Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_ecs_task_definition and continuous delivery to ecs #632

Closed
hashibot opened this issue Jun 13, 2017 · 97 comments
Closed

aws_ecs_task_definition and continuous delivery to ecs #632

hashibot opened this issue Jun 13, 2017 · 97 comments
Labels
enhancement Requests to existing resources that expand the functionality or scope. service/ecs Issues and PRs that pertain to the ecs service.

Comments

@hashibot
Copy link

This issue was originally opened by @dennari as hashicorp/terraform#13005. It was migrated here as part of the provider split. The original body of the issue is below.


With the task and container definition data sources I'm almost able to get our continuous delivery setup to play nicely with Terraform. We rebuild the docker image with a unique tag at every deployment. This means that after the CI service redeploys a service, the corresponding task definition's revision is incremented and the image field in a container definition changes.

I dont' seem to be able to create a setup where the task definition could be managed by Terraform in this scenario.

Terraform Version

v0.9.1

Affected Resource(s)

  • resource aws_ecs_task_definition

Terraform Configuration Files

# Simply specify the family to find the latest ACTIVE revision in that family.
data "aws_ecs_task_definition" "mongo" {
  task_definition = "${aws_ecs_task_definition.mongo.family}"
}
data "aws_ecs_container_definition" "mongo" {
  task_definition = "${data.aws_ecs_task_definition.mongo.id}"
  container_name  = "mongodb"
}

resource "aws_ecs_cluster" "foo" {
  name = "foo"
}

resource "aws_ecs_task_definition" "mongo" {
  family = "mongodb"
  container_definitions = <<DEFINITION
[
  {
    "cpu": 128,
    "environment": [{
      "name": "SECRET",
      "value": "KEY"
    }],
    "essential": true,
    "image": "${aws_ecs_container_definition.mongo.image}",
    "memory": 128,
    "memoryReservation": 64,
    "name": "mongodb"
  }
]
DEFINITION
}

resource "aws_ecs_service" "mongo" {
  name          = "mongo"
  cluster       = "${aws_ecs_cluster.foo.id}"
  desired_count = 2
  # Track the latest ACTIVE revision
  task_definition = "${aws_ecs_task_definition.mongo.family}:${max("${aws_ecs_task_definition.mongo.revision}", "${data.aws_ecs_task_definition.mongo.revision}")}"
}

The problem is then that after a CI deployment, terraform would like to create a new task definition. The task definition resource here points to an earlier revision and the image field is considered changed.

With the deprecated template resources, I was able to ignore changes to variables which solved this issue. One solution that comes to mind would be the ability to set revision of the aws_ecs_task_definition resource.

I'd be grateful for any and all insights.

@hashibot hashibot added the bug Addresses a defect in current functionality. label Jun 13, 2017
@kurtwheeler
Copy link

I have run into this issue as well. I think the solution I am going to go with is to not have the task definition be managed by terraform. Circle CI has a blog post about how to push a new task definition via a script they provide.

I agree that the ability to set the aws_ecs_task_definition would enable managing the task definition via Terraform. However philosophically it does seem to break Terraform's model of having resource blocks correspond to resources within AWS. If that parameter were added then a single aws_ecs_task_definition resource block would be responsible for creating multiple AWS resources.

@JDiPierro
Copy link

JDiPierro commented Jun 28, 2017

I've gotten around this by using terraform taint as part of the deploy process:

  • Push the container to the ECR
  • terraform taint the task-def
  • terraform apply makes a new revision and updates the service

@naveenb29
Copy link

@JDiPierro While using the taint solution - does it not kill the current task and replace rather than deploy the new task and then drain the old one ?

@JDiPierro
Copy link

@naveenb29 Nope. I believe that would be the case if you were tainting the ECS Service. Since just the task-def is being recreated the ECS service is updated causing the new tasks to deploy. ECS waits for them to become healthy and then kills the old containers.

@tomelliff
Copy link
Contributor

Our CI process is tagging the image as it's pushed to ECR and then passing that tag to the task definition. This automatically leads to the task definition changing so that Terraform knows to recreate it and then that's linked to the ECS service causing it to update the service with the new task definition.

I think I'm missing the issue that others are having here.

That said, I'd like for Terraform to (optionally) wait for the deployment to complete: new task definition has running tasks equal to the desired count and, potentially, that the other task definitions have been deregistered so no new traffic will reach them. I can't see a nice way to get at that information using the API though as the events don't really expose enough information so you'd probably have to describe the old task def, find the running tasks using it, find the ports they were running on, check the ALB they are registered to has all of those ports set as draining.

For now I'm simply shelling out and waiting until the PRIMARY service deployment has a running count equal to the desired count (doesn't catch the short period from PRIMARY tasks being registered to old tasks being deregistered) or waiting until the deployment list has a length of 1 (all old task definitions have been completely drained which is overkill as new connections won't arrive there so the deployment can be considered completed before this).

@radeksimko
Copy link
Member

Hi folks,
see more detailed explanation of why this is happening at #13 (comment)

Labelling appropriately...

@radeksimko radeksimko added the upstream-terraform Addresses functionality related to the Terraform core binary. label Aug 31, 2017
@dmikalova
Copy link

I was able to solve the inactive task definition issue with the example in the ECS task definition data source. You set up the ECS service resource to use the the max revision of either what your Terraform resource has created, or what is in the AWS console which the data source retrieves.

The one downside to this is if someone changes the task definition, Terraform will not realign that to what's defined in code.

@Esya
Copy link

Esya commented Oct 24, 2017

What do you guys think about having a remote backend (It's S3 in my case), and having your CI pipeline create the new task definition and changing the .tfstate file directly to match it ?

For example, mine looks like this :

"aws_ecs_task_definition.backend_service": {
    "type": "aws_ecs_task_definition",
    "depends_on": [
        "data.template_file.task_definition"
    ],
    "primary": {
        "id": "backend-service",
        "attributes": {
            "arn": "arn:aws:ecs:eu-west-1:REDACTED:task-definition/backend-service:8", // This could be changed manually
            "container_definitions": "REDACTED",
            "family": "backend-service",
            "id": "backend-service",
            "network_mode": "",
            "placement_constraints.#": "0",
            "revision": "8", // This could be increased manually
            "task_role_arn": ""
        },
        "meta": {},
        "tainted": false
    },
    "deposed": [],
    "provider": ""
},

Couldn't we just change the arn and the revision, so that the next time terraform runs, it still thinks it has the "latest version" of the task defintion in it's state ?

@chriswhelix
Copy link

I'm not sure I understand the problem y'all are trying to solve here. Why not just use terraform to create the new task definition in the first place, and then your tf state is always consistent? Our setup is similar to what @tomelliff describes.

@Esya
Copy link

Esya commented Oct 25, 2017

@chriswhelix Well in my particular case, I have two separate repositories. One that holds the terraform project, and it creates my ECS cluster, my services, and the initial task definition.

The other one is for a specific service, and I'd like to have some CI/continuous delivery flow in place (Using gitlab pipelines in my case) to "containerize" the project, push it to ECR, and trigger a service update on my ECS cluster. (Edit: as a reminder, currently, if we use the aws cli to do this as part of our CI workflow, then the next terraform run will overwrite the task def.)

So, when you say "use terraform to create the new task definition in the first place", are you implying that on our CI system, when pushing our service's code, we should also clone our terraform repo, change the variable that holds the image tag or that service, do a terraform apply, and commit + push to the TF repository?

tl;dr: Need a way to trigger service updates from any of our projects' build pipeline, without any user interaction with terraform.

@chriswhelix
Copy link

@Esya what we do is that each project has in its build config the version of the terraform repo it is expecting to be deployed with. When the CI pipeline is ready to deploy, it pulls down the terraform repo using the git tag specified in the project build config, then runs terraform against that, providing the image tag it just wrote to the ECR repo as an input variable.

We don't write down the ECR image tag in the terraform repo; it must be provided each time terraform is run. So that avoids simple code updates to projects requiring any change to the terraform repo.

@schmod
Copy link

schmod commented Nov 1, 2017

I'm using ecs-deploy in my deployment pipeline, and a terraform config that looks something like this:

# Gets the CURRENT task definition from AWS, reflecting anything that's been deployed
# outside of Terraform (ie. CI builds).
data "aws_ecs_task_definition" "task" {
  task_definition = "${aws_ecs_task_definition.main.family}"
}

# "Dummy" application for initial deployment
data "aws_ecr_repository" "sample" {
  name = "sample"
}

# ECR Repo for our actual app
data "aws_ecr_repository" "main" {
  name = "${var.ecr_name}"
}

resource "aws_ecs_task_definition" "main" {
  family = "${var.name}"
  task_role_arn = "${module.iam_roles.ecs_service_deployment_role_arn}"

  container_definitions = <<DEFINITION
[
  {
    "name": "${var.name}",
    "image": "${data.aws_ecr_repository.sample.repository_url}:latest",
    "essential": true,
    "portMappings": [{
      "containerPort": ${var.container_port},
      "hostPort": 0
    }]
  }
]
DEFINITION
}

resource "aws_ecs_service" "main" {
  name = "${var.name}"
  cluster = "${var.cluster}"
  desired_count = 2
  task_definition = "${aws_ecs_task_definition.main.family}:${max("${aws_ecs_task_definition.main.revision}", "${data.aws_ecs_task_definition.task.revision}")}"
  iam_role = "${module.iam_roles.ecs_service_deployment_role_arn}"

}

During the initial deployment, Terraform deploys an "empty" container. When the CI pipeline runs, ecs-deploy creates a new task definition revision with the newly-built image/tag, and updates the service accordingly.

Terraform recognizes these new deployments via data.aws_ecs_task_definition.task, and doesn't attempt to overwrite them.

HOWEVER, if other parts of the task definition change, Terraform will redeploy the sample application, as it'll try to create a new revision of the task definition (using the config containing the sample application). Hypothetically, data.aws_ecs_container_definition could be used to pull the image of the currently-active task definition. However, I haven't been able to figure out a way to use this that doesn't create a circular dependency or result in a chicken/egg problem during the initial deployment (ie. the data source is looking for a task definition that hasn't been created yet):

data "aws_ecs_task_definition" "task" {
  task_definition = "${aws_ecs_task_definition.main.family}"
}

data "aws_ecs_container_definition" "task" {
  task_definition = "${data.aws_ecs_task_definition.task.id}"
  container_name  = "${var.name}"
}

resource "aws_ecs_task_definition" "main" {
  family = "${var.name}"
  task_role_arn = "${module.iam_roles.ecs_service_deployment_role_arn}"

  container_definitions = <<DEFINITION
[
  {
    "name": "${var.name}",
    "image": "${data.aws_ecs_container_definition.task.image}",
  }
]
DEFINITION
}

This creates a cycle, and won't work during the initial deployment.

This is very close to my ideal setup. If Terraform somehow supported a "get or create" data/resource hybrid, I'd be able to do almost exactly what I'm looking for.

@chriswhelix
Copy link

@schmod you could possibly use a var to create a special bootstrapping mode, i.e. "count = var.bootstrapping ? 0 : 1" to turn on/off the data sources, and coalesce(data.aws_ecs_container_definition.task.*.image, "fake_image") on the task def.

I feel like if you're going to manage a particular resource with terraform, it's really best to make all modifications to it using terraform, though. If you solve this issue for container images, you're just going to have it again for service scaling, and again for environment changes, and again for anything else ecs-deploy does behind terraform's back.

What we really need is good deployment tools that work with terraform instead of around it.

@schmod
Copy link

schmod commented Nov 1, 2017

Given the scope of what Terraform is allowed to do to my AWS resources, I'm rather apprehensive about running it in an automated/unmonitored environment. On the other hand, I can control exactly what ecs-deploy is going to do.

Infrastructure deployments and application deployments are very different in my mind. There's a fairly large and mature ecosystem around the latter, and I don't think that Terraform should need to reinvent that wheel. It should merely provide a configuration interface to tell it the exact set of changes that I expect those external tools to make.

We already have a version of that in the form of the ignore_changes lifecycle hook. My problem could also be solved if we supported container_definition as a 1st-class citizen (similar to aws_iam_policy_document), and allow something like ignore_changes=["container_definition.image"].

@chriswhelix
Copy link

@schmod isn't the real issue what your build agent is permissioned to do? If your build agent has least privileges for the changes you actually want it to make, shouldn't matter which tool makes them.

I agree that the interface between terraform and existing deployment tools seems like a generally awkward area. We've dealt with that mostly by just writing our own deployment scripts, in conjunction with what ECS provides out of the box. I'm not sure it's a problem that's solvable solely by changes to terraform, though; in this case, the fundamental problem is that there's no clean divide between the "infrastructure" part of ECS and the "application" part of ECS. That's really Amazon's fault, not terraform's.

There is a clean boundary at the cluster level -- i.e. it would be easy to have terraform manage all the backing instances for an ECS cluster, and another tool manage all the services and tasks running on the cluster. If your basic philosophy is a strong divide between "infrastructure" and "applications", it seems like drawing that line right through the middle of a task definition creates much too complicated a boundary to easily manage.

@schmod
Copy link

schmod commented Nov 6, 2017

Right. The problem is that (in my use-case, and probably most others) an application deployment should change exactly one parameter on the task definition (image).

It's difficult to draw a line around the task definition, however, because it contains a lot of other configuration that I'd really prefer to remain static (and managed by Terraform). This makes it unattractive to draw a clean boundary at the cluster level (and also leaves both your Service and Task Definition completely unmanaged by Terraform).

As I mentioned earlier, the ignore_changes flag has been used elsewhere to help accommodate similar use-cases, and there's probably room to build out support for that in a way that shouldn't require fundamentally changing how Terraform works.

@dev-head
Copy link

dev-head commented Dec 1, 2017

we share the same use case as most people are reporting here.

our deployments are uniquely tagged, which requires a new task definition to update the ecs service on each deployment. This happens outside of the control of terraform due to a variety of reasons which are not important to the issue at hand.

Seems like we need the ability to ignore changes on aws_ecs_service resource; we can't do that right now due to TF not supporting interpolations in lifecycle blocks as this resource is part of shared module.

@damscott
Copy link

damscott commented May 10, 2018

I worked around this by using a bash script in an External Data Source to return the current image for the container definition. If the script gets an error looking up the task definition then it assumes this is the initial infrastructure deployment and it uses a default value.

resource "aws_ecs_task_definition" "task" {
  family = "${var.app}-${var.env}"
  task_role_arn = "${aws_iam_role.app_role.arn}"
  container_definitions = <<JSON
[
  {
    "name": "${var.app}",
    "image": "${aws_ecr_repository.app_repo.repository_url}:${data.external.current_image.result["image_tag"]}"
  }
]
JSON
}

data "external" "current_image" {
  program = ["bash", "${path.module}/ecs-get-image.sh"]
  query = {
    app = "${var.app}"
    cluster = "${var.cluster_id}"
  }
}

ecs-get-image.sh:

#!/bin/bash

# This script retrieves the container image running in the current <app>-<env>
# If it can't get the image tag from AWS, assume this is the initial
# infrastructure deployment and default to "latest"

# Exit if any of the intermediate steps fail
set -e

# Get parameters from stdin
eval "$(jq -r '@sh "app=\(.app) cluster=\(.cluster)"')"

taskDefinitionID="$(aws ecs describe-services --service $app --cluster $cluster | jq -r .services[0].taskDefinition)"

# Default to "latest" if taskDefinition doesn't exist
if [[ ! -z "$taskDefinitionID" && "$taskDefinitionID" != "null" ]]; then {
  taskDefinition="$(aws ecs describe-task-definition --task-definition $taskDefinitionID)"
  containerImage="$(echo "$taskDefinition" | jq -r .taskDefinition.containerDefinitions[0].image)"
  imageTag="$(echo "$containerImage" | awk -F':' '{print $2}')"
} else {
  imageTag="latest"
}
fi

# Generate a JSON object containing the image tag
jq -n --arg imageTag "$imageTag" '{"image_tag":$imageTag}'

exit 0

It triggers a new task definition in Terraform when anything in the container_definition besides the image is changed, so we can still manage memory, cpu, etc, from Terraform, and it plays nicely with our CI (Jenkins) which pushes new images to ECR and creates new task definitions to point to those images.

It may need some reworking to support running multiple containers in a single task.

Edit:

If you are using the same image tag for every deployment (e.g. "latest", "stage") then this will revert to whatever task definition is in the state file. It doesn't break anything, but it is confusing. A work around for this can be done by creating an external data source similar to this one that returns the current task definition running in AWS to the aws_ecs_service if the image tag hasn't changed.

Edit 2:
I updated the script and tf file to also return the task definition revision number. This lets us use a ternary on aws_ecs_service.task_definition to always use the most current revision, eliminating the issue where it rolled back the task definition if you always use the same image tag. I put the updated code in a gist:
https://gist.github.com/damscott/9da8f2e623cac61423bb6a05839b10a9

This still does not support multiple containers in a single task definition.

I also want to say thanks to endofcake; I looked at your python version and took a stab at rewriting my code in python. I learned a lot, but ultimately stuck with bash because it's less likely to introduce dependency issues.

@endofcake
Copy link
Contributor

endofcake commented May 22, 2018

I've also used an external data source as a workaround. The main difference is that it's written in Python, supports multiple containers in the task definition, and does not fall back to latest (it's the responsibility of Terraform).

The script is here:
https://gist.github.com/endofcake/4ea2ac5c030a37965b65c7591c83a047

Here's a snippet of Terraform configuration that uses it:

data "external" "active_image_versions" {
  program = ["python", "/scripts/get_image_tags.py"]

  query = {
    cluster_name = "${data.terraform_remote_state.ecs.ecs_cluster_id}"
    service_name = "${var.app_name}"
  }
}

<...>
data "template_file" "task_definition" {
  template = "${file("task_definitions/sample.tpl")}"

  vars {
    # lookup the image in the external data source output and default to 'latest' if not found
    app_image              = "${aws_ecr_repository.sample.repository_url}:${lookup(data.external.active_image_versions.result, var.app_name, "latest")}"
    proxy_image            = "${aws_ecr_repository.sample.repository_url}:${lookup(data.external.active_image_versions.result, var.proxy_name, "latest")}"
  }
}

This solved the problem with Terraform trying to reset the image in the task definition to the one it knew about. However, after an app deployment which happens outside of Terraform it still detects changes in the task definition the next time it runs. It then creates a new task revision, which triggers a bounce of the ECS service - essentially a redeployment of the same image version. I could find no way to prevent it from doing this so far.

@endofcake
Copy link
Contributor

After some investigation it looks to me like the problem is caused by Terraform not knowing about the new task revision.

Say, the last revision it knows about is 36. This is the version stored in it remote state, it's also the revision used by the ECS service as far as Terraform is concerned. The currently active revision is meanwhile 38, and it uses a new Docker image. With the workarounds like above Terraform is able to grab this image version by describing the current ECS task definition, but it then tries to create a new task revision with it, which in turn triggers a redeployment of the ECS service.

This lack of clear separation between infrastructure and application deployments turns out rather problematic, and I'm not sure how we can work around that.

@mboudreau
Copy link

How has this not been resolved yet? ECS has been around for a while and CI deployments outside of terraform seems like the standard operating procedure, and yet here I am still trying to get this new deployment working...

@endofcake
Copy link
Contributor

See also this approach, which looks more promising:
#3485

@codergolem
Copy link

Hi everyone,

Sorry but I am struggling to understand the problem most of people are having, namely : why do you want to avoid a new task definition revision being created when the image has changed? is not that the standard way of deploying a new image to ecs, or how are you doing it otherwise?

@endofcake
Copy link
Contributor

endofcake commented Jun 5, 2018

@codergolem , it's not about avoiding the new task definition, it's about making Terraform play nicely with changes which happen outside of Terraform, namely application deployments in ECS. Terraform is an infrastructure management tool, and just doesn't cut it when it comes to application deployments, even if we bolt on wait-conditions on it. This really looks more like a problem with AWS API than with Terraform, but so far I see no way to resolve this impedance mismatch cleanly.

@mboudreau
Copy link

@codergolem To put @endofcake reply into context, let me provide our example:

  • We have terraform build an ECS cluster, ECR, security groups, roles, services, tasks, etc. This is to make sure our stack is solid and reproducible between dev/prod and between our different regions.
  • After the cluster/services are created, we then use a CI server (in this case TravisCI) to build our docker image, push it to ECR, then use the AWS CLI to update the task definition to use the latest built image that we just pushed to ECR.

We do it this way for several reasons:

  1. The code doesn't really have to know much about the infrastructure, just the ECR name, the cluster name and the task to update.
  2. The terraform template doesn't have to care about the code or which version is considered the "latest" in the ECR docker image
  3. It's simpler to have a single process to deploy a new version, instead of have one create the new version, then another process to update the task with the latest version.
  4. Security; if you want to deploy this in a single process using Terraform, the AWS user that needs be in use on TravisCI (or any CI server) would have to have some fairly open permissions, which is a massive security vector. I'd much rather leave the terraform apply step to be done on a developers computer where human interaction is required for having such high priviledges, then have the CI server have a user with very simple permissions (ie. ECR push, task update)

Because of these reasons, it's making it very difficult to use terraform with a CI server when you want to specify the task definition structure within terraform, which I would implore is needed since it would need references to the role created for the task or any other references being used for the infrastructure.

@adamlc
Copy link

adamlc commented Oct 27, 2020

@blytheaw I'm wondering if you could use a null_resource with a provisioner to trigger the changes somehow?

I have a similar problem, just trying to find the best solution. We use GitLab so I'm going to try to see if I can get the null_resource to trigger a GitLab pipeline when something changes by using a provisioner with curl.

@bilbof
Copy link

bilbof commented Oct 29, 2020

We also hit this problem.

I agree with what others have said: ECS could make it easier to draw a line between infrastructure and application deployments. It has led to us doing something quite idiosyncratic where we'd prefer to follow convention.

I think HashiCorp's answer to this may be the recently announced Waypoint. Right now Waypoint doesn't integrate enough with Terraform to meet our needs (i.e. task definitions reference secrets and resources created by Terraform).

There are a few ways we explored to fix this:

a) Deploy with Terraform, use var=app_image_tag=release_123 when applying to release specific image
b) Manage task definitions outside of Terraform, use the AWS CLI to update an ECS service created by Terraform (with ignore_changes set on the Service resource task_definition attribute) to use the newly minted task definition during deploys
c) Create task definitions and services in Terraform during bootstrapping, but manage task definition revisions in a separate statefile (using the same task def modules) to enable more involved / custom deploys.

We've gone for option C for now.

This gives us a single source of truth for config (apart from docker image_tags, which are passed in as -var args). This way we can have all infrastructure, including task definitions, defined in Terraform (having cake), and also update the ECS Service to use the new task definitions using the AWS CLI (eating cake too).

From an operator perspective the workflow looks like this:

# updating infra
cd environment && terraform apply -var=app_a_image_tag=tag_of_thing_already_in_env ...

# deploying app
cd apps/app_a
terraform apply -var=app_a_image_tag=release_123
task_definition_arn=$(terraform output task_definition_arn)
aws ecs update-service --cluster my-cluster --service app_a \
--task-definition $task_definition_arn --region $AWS_REGION

To add another benefit to deploying apps separately to infrastructure updates: deploy hooks. Sometimes you want to run a task, using the same task definition you use for apps, but for a different service (e.g. db migrations pseudocommand: aws ecs run-task --task-def=my-task --cmd=rails db:migrate --service=migrator). Things like Capistrano give you flexibility to do that kind of thing, and this approach gives you something like that (e.g. 1) create task def, 2) run tasks using def, 3) update service).

@sp3c1
Copy link

sp3c1 commented Feb 19, 2021

@bilbof so, when you have 40 services in this cluster, you do this manually for every service? That is a bit of nonsense.

@bilbof
Copy link

bilbof commented Feb 19, 2021

@sp3c1 thanks for your feedback. The procedure I described above is automated using a Concourse pipeline; releases are continuously deployed without manual intervention. Since you mention it… our design has changed to following procedure, which gives us more control over release deployments (this is automated):

a) ECS Services and other resources are managed by Terraform
b) Task definition revisions are pushed using the AWS CLI

I’d be keen to improve on this. If there is a generally accepted pattern for continuously deploying to ~100 ECS Services managed with Terraform, I'd like to adopt it.

@schmod
Copy link

schmod commented Feb 20, 2021

I’d be keen to improve on this. If there is a generally accepted pattern for continuously deploying to ~100 ECS Services managed with Terraform, I'd like to adopt it.

After 4 years of following this issue, the consensus seems to be that, no there is not. You're (more or less) following the same compromise that most folks have landed on.

@maartenvanderhoef
Copy link

For a little while I managed one of the larger Terraform modules for ECS. Since EKS came, many issues stayed unresolved and it was clear that AWS is letting ECS for what it is. A big shame because ECS is the 'exoskeleton' way of managing docker services and from the beginning it was very close to perfection. As with many AWS products; ElasticBeanstalk, Amplify, they get to a level where it's working good enough but AWS never gives them the final paint job.

I haven't touched ECS the recent years, but my belief now is to really integrate the creation and updating of services within the CI/CD itself. SSM could be used with Terraform to centrally orchestrate parts like memory consumption which can later be used by CI/CD et cetera. This, or completely use Terraform to build up Codepipeline/Codebuild and have control over the ECS Services' configuration by managing its CICD layer.

@zen0wu
Copy link

zen0wu commented Jun 28, 2021

After reading through the thread and much thinking, I decided to take the following approach.

  • Have an application script to generate the task definition as the single source of truth, because our application uses TS, it's more naturally integrated with JSON and it's strongly typed (while TF template is not great for either)
  • Have a null_resource executing the script to "ensure the task definition exists" in ECS
  • Have a data source that selects the most recent active task def, since typically deployment uses the most recent active revision (at least in our case), and this depends on the null_resource
resource "null_resource" "task_definition_generator" {
  triggers = {
    family  = var.family
    command = var.generator_command
  }

  provisioner "local-exec" {
    command     = var.generator_command
    working_dir = local.root_dir
  }
}

data "aws_ecs_task_definition" "task_def" {
  depends_on = [null_resource.task_definition_generator]

  # This pulls in the latest ACTIVE revision. It might or might not be
  # the one created by the generator_command, but that's generally ok.
  # We're just assuming the latest version is always working.
  task_definition = var.family
}

@WhyNotHugo
Copy link

@zen0wu How do you apply updates to task definitions? E.g.: chaging the memoryReservation, environmentFiles or other attributes?

@zen0wu
Copy link

zen0wu commented Jul 10, 2021

@WhyNotHugo all the actual task definition belongs to our TS code, so if we want to update those, we'll just trigger a deploy, by first calling the same generator command (to create the new task definition) and then call UpdateService.

TF only manages limited info, which includes, which load balancer to use, how many containers in total, essential things that only belongs to the service.

@WhyNotHugo
Copy link

Task definitions have a few terraform-managed resources in my case (group log, environment variables, ssm arns, and IAM arns).

How do you get those values from terraform into the deploy process?

@zen0wu
Copy link

zen0wu commented Jul 12, 2021

@WhyNotHugo good question - ideally we can pass those as arguments into the generator command, by passing those in as command line arguments when terraform calls it, since TF has those values. But for the continuous delivery part (TypeScript in our case, running alone for the deploy), they'll either have to pull the values out of a - existing values in the task definition, b - from terraform (by running terraform console), or c - hard code (this is what we did :p, since we only need taskRoleArn)

@bholzer
Copy link

bholzer commented Jul 12, 2021

@WhyNotHugo

Task definitions have a few terraform-managed resources in my case (group log, environment variables, ssm arns, and IAM arns).

How do you get those values from terraform into the deploy process?

First make an API call to describe the existing task definition which contains those fields. Then modify just the image (or any other fields you want) from that response, and pass it into a task def registration call. How I handled this:

task_def_family_name=${1:-}
image=${2:-}
container_name=${3:-}

if [[ -z "$task_def_family_name" ]] ; then
    echo 1>&2 "error: A task definition family name is required. May also include a revision (familyName:revision)"
    exit 1
fi

if [[ -z "$image" ]] ; then
    echo 1>&2 "error: An image is required"
    exit 1
fi

# Format the response in a way that is easily usable by register-task-definition
latest_task_definition=$( \
    aws ecs describe-task-definition \
        --include TAGS \
        --task-definition "$task_def_family_name" \
        --query '{  containerDefinitions: taskDefinition.containerDefinitions,
                    family: taskDefinition.family,
                    taskRoleArn: taskDefinition.taskRoleArn,
                    executionRoleArn: taskDefinition.executionRoleArn,
                    networkMode: taskDefinition.networkMode,
                    volumes: taskDefinition.volumes,
                    placementConstraints: taskDefinition.placementConstraints,
                    requiresCompatibilities: taskDefinition.requiresCompatibilities,
                    cpu: taskDefinition.cpu,
                    memory: taskDefinition.memory,
                    tags: tags}' \
)

container_count=$(jq -r '.containerDefinitions | length' <<< "$latest_task_definition")

if [[ "$container_count" -gt 1 ]] && [[ -z "$container_name" ]] ; then
    echo 1>&2 "error: The task definition has more than one container definition, you must choose one."
    exit 1
fi

# If there's only one container in the task definition, update its image, otherwise look by container name
# We should never make it to the `else` here, but we just create a duplicate revision if we do.
new_task_definition=$(echo "$latest_task_definition" \
    | jq -rc --arg containerName "$container_name" --arg newImage "$image" \
        '.containerDefinitions |= (
            if ( . | length ) == 1 then
                .[].image = $newImage
            elif ($containerName | length) > 0 then
                map(select(.name == $containerName).image = $newImage)
            else
                .
            end
        )'\
)

registration_response=$(aws ecs register-task-definition --cli-input-json "$new_task_definition")

new_revision=$(jq '.taskDefinition.revision' <<< "$registration_response")
old_revision=$((new_revision-1))

deregistration_response=$(aws ecs deregister-task-definition --task-definition "$task_def_family_name:$old_revision")

@anGie44
Copy link
Contributor

anGie44 commented Jul 26, 2021

Hi @dennari and all those following this issue 👋 .Thank you again for submitting/providing feedback to this issue. As noted by others in the comments above, because Terraform expects full management of the ECS Task Definition resource and the upstream ECS API does not support methods to appropriately manage individual revisions without replacement, we cannot provide any patches to the current state of resource’s behavior in the provider and thus will be closing this issue.

Patches that suggest setting or manipulating the state difference of an ECS Task Definition’s’ revision would imply having one resource responsible for multiple AWS resources, and this would prove problematic to traditional practitioner experience with the rest of the Terraform provider ecosystem. With that said, we want to note that the ECS API does not allow for a great user experience when using Terraform and separate tooling like ecs-deploy, and thus encourage practitioners to proceed with caution when integrating both in a CI/CD pipeline.

@anGie44 anGie44 closed this as completed Jul 26, 2021
@WhyNotHugo
Copy link

WhyNotHugo commented Jul 26, 2021

@bholzer My issue with that approach is that the next time terraform runs, it detects changes in the task definitions, and tries to recreate them.

I used to have a hack to work around this: a data aws_ecs_container_definition which found the latest version and re-used the image defined there. However, this had two issues:

  • Each time there's a deployment, terraform will see the dirty state and update the task definitions. The newly created one matches the latest anyway, but it adds a lot of diff noise each time terraform is run, especially with a lot of task definitions.
  • Creating a new task definition is a pain due to the circular dependency between the data and resource.

I finally found a solution that really works from all angles. I create a "template" task definition in terraform, which is fully terraform-managed and never altered outside of terraform:

resource "aws_ecs_task_definition" "django_template" {
  for_each = local.full_websites

  family = "django-${each.key}-template"
  container_definitions = jsonencode([{
    name              = "django"
    command           = ["/app/scripts/run-django"]
    essential         = true
    image             = "whynothugo/sleep"
    memoryReservation = 600
    portMappings      = [{ containerPort = 8000, protocol = "tcp" }]
    user              = "django"

    # Zeep's cache failes with this on:
    # readonlyRootFilesystem = true
    linuxParameters = { tmpfs = [{ containerPath = "/tmp", size = 100 }] }

    environmentFiles = [
      {
        value = local.envfile_arns[each.key]
        type  = "s3"
      }
    ]

    healthCheck = {
      command  = ["/app/scripts/healthcheck-django"]
      interval = 30
      retries  = 3
      timeout  = 5
    }

    logConfiguration = {
      logDriver = "awslogs"
      options = {
        awslogs-group         = aws_cloudwatch_log_group.django[each.key].name
        awslogs-region        = "us-west-2"
        awslogs-stream-prefix = "ecs"
      }
    }

  }])

  task_role_arn      = aws_iam_role.production_task_role.arn
  execution_role_arn = aws_iam_role.production_task_execution_role.arn
  network_mode       = "bridge"

  placement_constraints {
    type       = "memberOf"
    expression = "attribute:Role == Web"
  }

  requires_compatibilities = ["EC2"]

  tags = {
    Component    = "Django",
    Environment  = "Production"
    BaseImageUrl = aws_ecr_repository.production.repository_url
  }
}

These don't have the right image though. My deployment pipeline will find the task definition (the one with -template), replace the image, and save it under a new name (without -template). My ECS services point to the non-template ones.

The base image URL for each service is specified in the tags, so the deployment script merely appends the desired tag to it. This also implies that tags change on each deployment, allowing automatic rollbacks to work.

Finally, my services initially point to the "template" task definition, but include:

  lifecycle {
    ignore_changes = [task_definition, load_balancer]
  }

This means that after the first deploy, ECS replaces the task definition for the service, and terraform never touches that again. Ever.

I've been using this setup and found this setup works really well. Deployments never result in any noise in terraform plans, and terraform itself FULLY manages the template, while deployments operate on a separate TD.

@WhyNotHugo
Copy link

@anGie44 I understand that terraform won't make changes to support this odd-case. Do you mind not locking this issue, so workarounds can continue to be discussed in comments here (including, probably, by others in the same situation in future).

@Sytten
Copy link

Sytten commented Jul 26, 2021

@anGie44 Terraform could provide some help to allow for a better user experience still.
There is currently no good way to tell it: check if that resource exists or create it otherwise. Some people proposed work around but they are not perfect and can create loops.
Since this is a very common use case I would expect something to be done about it, if not by the official aws provider then by some community driven provider. A won't fix is not helpful...

Looking at what I am doing with GCP Cloud Run, I can quite easily use the provider to change the memory size without changing the image that is used since that is managed by the CICD (like ecs-deploy). It is just a matter of considering the task definition as a whole state (modified using revisions) without pinning it to one revision. So you can ignore the changes in say image and still detect changes in memory/cpu.

@sworisbreathing
Copy link
Contributor

sworisbreathing commented Jul 27, 2021

@anGie44 thank you for clearly explaining the team's position on this issue. I agree, ECS doesn't do a great job of assisting the developer experience here, due to the fact that task definitions are immutable and versioned.

I'd had it in the back of my mind for a while now to roll my own provider while #11506 was being ignored, but other work took priority and I was able to get by with manual hacks. Since this basically confirms the PR won't be accepted, it seems the community is left with little choice but to use a custom provider in order to make ECS work properly.

To echo @WhyNotHugo's request, please don't lock this issue so the user community can continue to discuss workarounds.

@sworisbreathing
Copy link
Contributor

sworisbreathing commented Jul 27, 2021

also @anGie44:

having one resource responsible for multiple AWS resources... would prove problematic to traditional practitioner experience with the rest of the Terraform provider ecosystem

Isn't that exactly what the aws_security_group resource does when you use the ingress{} and egress{} rule blocks though?

@WhyNotHugo
Copy link

Since this basically confirms the PR won't be accepted, it seems the community is left with little choice but to use a custom provider in order to make ECS work properly.

The "template" task definition approach works, and doesn't violate any of the principles that terraform or ECS follow. The biggest downside is that listing task definitions yields more results, so if you interact manually with ECS a lot, then that might be annoying.

@schmod
Copy link

schmod commented Jul 29, 2021

Patches that suggest setting or manipulating the state difference of an ECS Task Definition’s’ revision would imply having one resource responsible for multiple AWS resources

I'm kind of surprised that you're viewing task revisions as distinct resources. Surely there's precedent in the Terraform ecosystem for managing resources that maintain immutable version-histories?

@WhyNotHugo
Copy link

Surely there's precedent in the Terraform ecosystem for managing resources that maintain immutable version-histories?

Lambdas.

I'm kind of surprised that you're viewing task revisions as distinct resources.

There's two "dimensions" in which I make changes to task definitions:

  • During infra changes, via Terraform. Changes only the image_url and version.
  • During deployments, via CodeDeploy. Changes anything except those too.

Having this concept of "Task definition template"and "Task definition" means that Terraform owns one of these, and CodeDeploy owns the other.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 29, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Requests to existing resources that expand the functionality or scope. service/ecs Issues and PRs that pertain to the ecs service.
Projects
None yet
Development

No branches or pull requests