Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS Milestone 2 Feedback #5

Open
mcasperson opened this issue Aug 2, 2021 · 17 comments
Open

ECS Milestone 2 Feedback #5

mcasperson opened this issue Aug 2, 2021 · 17 comments

Comments

@mcasperson
Copy link
Contributor

mcasperson commented Aug 2, 2021

Please add a comment to this issue with your feedback regarding the proposed ECS integration described here.

This milestone builds on the work from milestone 1.

We have a few general areas that we are actively seeking feedback on, listed below. But any feedback is welcome.

  • Will the ability to update an existing task definition and service support your existing ECS clusters?
  • What further ECS deployment challenges do you wish Octopus could solve for you?
  • Can you forsee any challenges that may stop you from using the proposed step with your existing ECS clusters?
@PeteGoo
Copy link

PeteGoo commented Aug 11, 2021

Will the ability to update an existing task definition and service support your existing ECS clusters?

Not really. We took a different approach:

  • Terraform ignores any changes to task definitions so the deployment doesn't clash with terraform apply
  • Task Definition json files are packaged into a feed and treated it like a normal package dependency
  • The ECR repository is also a feed
  • We have a custom step template that takes the above two dependencies and will replace the Octopus variables in the task-definition.json with the full image name:tag
  • The step first assumes a role into the target AWS account
  • The step creates a new revision of the task definition
  • Applies the new revision to the service
  • Monitors the new deployment and waits until all old tasks have been removed and the deployment has reached a "steady state"
    image

What further ECS deployment challenges do you wish Octopus could solve for you?

We also solved for the rolling of cluster instances using a custom step to double instances, drain and terminate old instances when complete.

Can you forsee any challenges that may stop you from using the proposed step with your existing ECS clusters?

  • Developers are not in control of the task definition so changes to environment variables become an awkward two stage process.
  • How can you change the load balancer as part of a deployment?

(sorry missed the opportunity on milestone 1 to give feedback)

@mcasperson
Copy link
Contributor Author

mcasperson commented Aug 11, 2021

Thanks for that feedback @PeteGoo. You have raised a number of good points.

The process you have outlined sounds similar to what we are proposing. I’ll add comments to each point to highlight where we differ. I’d love to know if the points of difference would be a prevent you from using the new step if you (or others that have already implemented a workflow like yours) wanted to migrate:

Terraform ignores any changes to task definitions so the deployment doesn't clash with terraform apply

Although it wasn’t called out in the RFC, we do foresee that anyone using tools like Terraform would use ignore_changes to gracefully handle external modification of the task definition and service. I expect we’ll have to document this more explicitly though.

Task Definition json files are packaged into a feed and treated it like a normal package dependency

This is one point where the proposed step differs. Because we don’t know how the task definition was originally created (Terraform, manually though the AWS console etc), the proposed step will copy the latest task definition and create a new revision with the image versions updated via SDK calls.

The ECR repository is also a feed

This will be the same for the proposed step.

We have a custom step template that takes the above two dependencies and will replace the Octopus variables in the task-definition.json with the full image name:tag

We’ll differ here by copying the latest task revision and updating the image with an SDK call.

The step first assumes a role into the target AWS account

This is a good point. Existing AWS steps in Octopus allow roles to be assumed, and we’ll need to expose this functionality in the new ECS targets.

The step creates a new revision of the task definition

The proposed step will also create a new revision of the task definition, but via SDK calls

Applies the new revision to the service

Like above, the proposed step will do this via SDK calls.

Monitors the new deployment and waits until all old tasks have been removed and the deployment has reached a "steady state"

This is a good point. We haven’t explicitly provided any options to wait for the deployment to complete.

Apart from assuming roles and waiting for the deployment to complete, it sounds like the proposed step will differ from the process you outlined by using SDK calls to copy a task revision, update it with new image versions, and then use SDK calls to update the service instead of using Terraform.

If the original Terraform template was set to ignore changes to task definition image versions and service task definition revisions, would the proposed step allow you to create your infrastructure with Terraform and update it with Octopus?

We also solved for the rolling of cluster instances using a custom step to double instances, drain and terminate old instances when complete.

We have put some thought into how these advanced deployments might work (although that functionality has been called out of scope for milestone 2). Our current thinking is to use an external deployment controller and use task sets (we have a blog post describing this process at https://octopus.com/blog/ecs-canary-deployments).

I’d love to know the details of how you implemented rolling deployments if you’d be willing to share them?

Developers are not in control of the task definition so changes to environment variables become an awkward two stage process.

This is a good point. Updating image versions was an obvious choice to automate as part of a CI/CD pipeline, but it sounds like we should consider updating environment variables as well.

How can you change the load balancer as part of a deployment?

To be honest swapping load balancers wasn’t a use case we had considered. Are you able to share a situation where a new load balancer was selected during a deployment?

EDIT: I realized this might have been a miscommunication. The ability to define load balancers was going to be added to the step being delivered as part of milestone 1 (where we take full ownership of the CloudFormation template). We won't be updating load balancers with the new steps targeting the updating of task definitions and service.

For us, the points to consider are:

  • Add the ability to wait for deployment to finish.
  • Consider allowing environment vars to be updated.
  • ECS targets need to be able to assume roles.

@pete-may-bam
Copy link

Will the ability to update an existing task definition and service support your existing ECS clusters?

  • Yes it will. We use terraform to create the Fargate stack and all the associated load balancing and networking. When we want to deploy a new version of the image, we rely on terraform to detect that the image version has changed in the container definition, and to only apply that change. This works but is a little scary because if terraform gets it wrong, then something will be screwed up. This is via a terraform apply step.
  • The image version is applied to our terraform via Octopus variable substitution. Your proposed mechanism of triggering a Fargate service to apply a new image version is just what we need - it will not subvert the terraform in any way.
  • With this new step we would move the terraform apply out of the main deploy process into a runbook that we executed only once, or rarely if the infrastructure changed, rather than running the apply to trigger a deploy.
  • We have found that the Fargate deploy mechanism is pretty good - it spins up new container instances and drain the previous instances before terminating them. When we do a deploy, this is all that we want, rather than terraform stepping through all the resources in the Fargate stack (38 of them) just to determine that it needs to update/replace the task definition.

What further ECS deployment challenges do you wish Octopus could solve for you?

  • It is not clear from your blog post how the image path and version is determined. Presumably it is in the Container images section of the step. We currently have a Run AWS CLI script step at the start of our Fargate deploy process. All this step does is get the image path for the referenced package - referenced in this step. The referenced package is from ECR via an Octopus external feed. We then use an output variable to expose the full image path to a subsequent step that does the terraform apply.
  • If the process of determining the image path could be simplified and/or consolidated into this new step that would be very helpful.

Can you foresee any challenges that may stop you from using the proposed step with your existing ECS clusters?

  • Not right now.

@mcasperson
Copy link
Contributor Author

mcasperson commented Aug 12, 2021

Hi @pete-may-bam,

If the process of determining the image path could be simplified and/or consolidated into this new step that would be very helpful.

We expect the Container images list to include the name of a container (this name matches what is in the task definition), and a docker image reference. The docker image version is selected at release creation time (much like the additional package reference you are using today), and the step will know how to update the task definition with the new image and version.

So the process will be:

  1. Add a container to the step
  2. Define the container name and select an associated docker image sourced from a feed
  3. Create a release and select the image version
  4. The step will take it from there by updating the task definition with the new image version, and updating the service with the new task definition revision.

Does that help explain how the deployment process would work?

@pete-may-bam
Copy link

pete-may-bam commented Aug 12, 2021

@mcasperson

Does that help explain how the deployment process would work?

Yes it does. It sounds great!

@pete-may-bam
Copy link

@mcasperson

Can you foresee any challenges that may stop you from using the proposed step with your existing ECS clusters?

Reading one of the other posts did make me think of something: getting environment variables into the containers. That would have to be built into the step somehow. I'm picturing a similar mechanism to the one used for the Deploy a release step template where you can specify multiple prompted variables to provide values for. In this case it would be a table of variable name / variable value pairs.

We currently solve that by iterating over a set of variables named like this: Project.EnvVar[VARIABLE_1].Value, Project.EnvVar[VARIABLE_2].Value

We have a script step that gathers those environment variable key value pairs into an environment block that is injected into the container definition via an output variable and variable substitution into the terraform script.

@mcasperson
Copy link
Contributor Author

@pete-may-bam

Reading one of the other posts did make me think of something: getting environment variables into the containers.

I agree, we overlooked the fact that both an image version and environment variables are likely to be updated with any deployment.

The first steps being delivered as part of milestone 1 will support setting environment variables, and we'll likely also include the ability to define env vars with the step developed for milestone 2.

@blytheaw
Copy link

Will the ability to update an existing task definition and service support your existing ECS clusters?

Not quite. From what I can tell, it doesn't allow me to make non-image related changes to the task definition with Terraform without things getting out of sync. Terraform won't be aware of the new revisions Octopus is creating, so if I update CPU resources in the Task Definition within Terraform, for example, it could revert me back to a previous image as the new revisions Octopus has made aren't accounted for by Terraform. Currently, to solve for this, our Terraform configuration maintains a "template" task definition that is never directly used by the ECS Service. Our Octopus deploy script step then copies the latest version of the template task definition as a new revision on the actual task definition that is used by the service. This still isn't ideal, as we don't get the changes made in Terraform until Octopus next deploys the app, but it prevents Octopus and Terraform from stepping on each other's toes. Most of this is just a result of the strange design of ECS task definitions, and it is hard to work around.

What further ECS deployment challenges do you wish Octopus could solve for you?

  • As others have mentioned in this thread, a good way to change/inject environment variables into the deployed task definitions would be valuable. Right now we are loading these from an external system at application runtime which isn't preferred.
  • Good rollback / failure handling. As mentioned above, it would be good to wait for the deployment to be successful. In the event that it isn't, what is the expected behavior? Rollback to the previous task definition revision automatically? Time out and require the user to manually stop the deployment attempts within ECS? One tricky part with copying the latest task definition is if something in that revision causes the deployment to fail, future deployments will fail (even the prior release in Octopus in this case, because it is just copying the latest revision). Our current process doesn't completely solve for this, and it doesn't happen often, but it is another awkward part of the way ECS works. Most commonly bad/missing environment variables are the cause of this. If these are handled in Octopus and are snapshots in the release, this should help mitigate these scenarios. But there are other cases that could cause it, like mis-configured IAM roles or firelens configurations.

Can you foresee any challenges that may stop you from using the proposed step with your existing ECS clusters?

  • As mentioned previously, I need the ability to actually make changes to the Task Definition (excluding the image) outside of Octopus (Terraform specifically). I want to manage things like cpu/memory allocation within Terraform that can be updated and changed outside of the application deployment lifecycle. We probably wouldn't use this step, as we would have to completely change everything about how we manage ECS services and task definitions from Terraform in order to fit into the assumptions of this new proposed step.

@mcasperson
Copy link
Contributor Author

mcasperson commented Aug 16, 2021

@blytheaw

From what I can tell, it doesn't allow me to make non-image related changes to the task definition with Terraform without things getting out of sync.

That is a good point. Digging into the Terraform provider, it looks like container_definitions are exposed as a JSON blob rather than discrete Terraform values. This means the container_definitions field can ignore external changes, but will ignore all changes rather than a select subset like container_definitions.image.

This has been discussed in a long running thread (which I see you have contributed to). One of the more recent posts implements a similar solutions to yours with a "template" task definition. Overall though the consensus appears to be that the ECS provider doesn't quite expose the required flexibility.

As a middle ground (and this is me "thinking out loud", as I'm sure you have considered this), the example shown below creates a task definition where Terraform will create the initial container settings, and then ignore subsequent container updates. This allows Terraform to be in control of creating the initial task definition, and then updating any settings outside of container_definitions:

terraform {
  backend "s3" {
    bucket = "mattc-terraform"
    key    = "ecs/taskdefinition"
    region = "us-east-1"
  }
}

variable "image" {
  type = string
  default = "nginx:1.20"
}

resource "aws_iam_role" "ecs_task_execution_role" {
  name = "role-name"
 
  assume_role_policy = <<EOF
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Action": "sts:AssumeRole",
     "Principal": {
       "Service": "ecs-tasks.amazonaws.com"
     },
     "Effect": "Allow",
     "Sid": ""
   }
 ]
}
EOF
}
resource "aws_iam_role" "ecs_task_role" {
  name = "role-name-task"
 
  assume_role_policy = <<EOF
{
 "Version": "2012-10-17",
 "Statement": [
   {
     "Action": "sts:AssumeRole",
     "Principal": {
       "Service": "ecs-tasks.amazonaws.com"
     },
     "Effect": "Allow",
     "Sid": ""
   }
 ]
}
EOF
}

resource "aws_iam_role_policy_attachment" "ecs-task-execution-role-policy-attachment" {
  role       = aws_iam_role.ecs_task_execution_role.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}

resource "aws_iam_role_policy_attachment" "task_s3" {
  role       = aws_iam_role.ecs_task_role.name
  policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}

resource "aws_ecs_task_definition" "definition" {
  family                   = "mytaskdef"
  task_role_arn            = aws_iam_role.ecs_task_role.arn
  execution_role_arn       = aws_iam_role.ecs_task_execution_role.arn
  network_mode             = "awsvpc"
  cpu                      = "256"
  memory                   = "1024"
  requires_compatibilities = ["FARGATE"]
  container_definitions    = <<DEFINITION
  [
    {
      "image": var.image,
      "name": "nginx"
    }
  ]
  DEFINITION

  lifecycle {
    ignore_changes = [
      # Ignore updates to container definitions, 
      # as these may be updated outside of Terraform
      container_definitions,
    ]
  }
}

Of all the proposed solutions I could see in that thread like the example above, modifying state directly, using external data, and use a template, it seems like the template approach is probably the most reliable.

This is likely easy to incorporate as an option into milestone 2. We could display an option to create a new task definition revision based on the latest one, or create a new task definition revision based on a second task definition. The logic behind the scenes is going to be much the same.

@egorpavlikhin Do you have any thoughts on adding this feature to milestone 2? I'm thinking something like this:

image

and

image

As mentioned above, it would be good to wait for the deployment to be successful. In the event that it isn't, what is the expected behavior? Rollback to the previous task definition revision automatically? Time out and require the user to manually stop the deployment attempts within ECS?

This has been discussed in the team recently, and we proposed adding the following options to the ECS deployment:

  1. Never time out while waiting for a deployment (i.e. make the changes and wait indefinitely for them to succeed).
  2. Don't wait for the deployment (i.e. make the changes and leave ECS to apply them in the background)
  3. Wait for a maximum amount of time, and on timeout either have the step either proceed successfully or fail

I imagine a roll back scenario would use option 3 where the step would fail after a certain time. We wouldn't be prescriptive about the solution in this scenario, but some examples could be a subsequent step in Octopus executed on failure to revert the service to the previous task definition, or a alert could be sent to Slack or email for a manual resolution.

If the step could time out and Octopus could run subsequent steps to remediate the deployment, would that allow you to implement the kind of rollback you had in mind?

@blytheaw
Copy link

@mcasperson

As a middle ground (and this is me "thinking out loud", as I'm sure you have considered this), the example shown below creates a task definition where Terraform will create the initial container settings, and then ignore subsequent container updates. This allows Terraform to be in control of creating the initial task definition, and then updating any settings outside of container_definitions:

Unfortunately the problem is at a higher level than just the container_definitions attribute. Because the container_definitions are ultimately just a subset of the task definition revision, and the fact that task definition revisions are immutable, an aws_ecs_service resource in Terraform points to a specific revision of a task definition in its state. So even if Terraform is ignoring changes to the container_definitions on the aws_ecs_task_definition resource, when Octopus deploys (thus creating a new revision and updating the service with that revision), the next Terraform run will see that the service is using a different revision than it knows about in its state. This is why task_definition on the aws_ecs_service has to be ignored entirely for them to play nice.

This is likely easy to incorporate as an option into milestone 2. We could display an option to create a new task definition revision based on the latest one, or create a new task definition revision based on a second task definition. The logic behind the scenes is going to be much the same.

I think this would suit our needs. As far as I can see, this template copying approach is the only way to fully manage the task definition with Terraform but also update the image (and ideally env variables) on each deployment). If the proposed step supports this, I believe we would be able to adopt it.

If the step could time out and Octopus could run subsequent steps to remediate the deployment, would that allow you to implement the kind of rollback you had in mind?

Yeah, this would be an improvement over our current solution's behavior.

Thanks!

@mcasperson
Copy link
Contributor Author

Thanks all for the replies so far. Here is a summary so far:

  • Continuous deployment in ECS means updating the image tag and environment variables.
  • ECS targets will need to be able to assume IAM roles.
  • There is a need to monitor the state of a deployment and allow action to be taken if services fail to deploy (although the particular action to be taken is delegated to subsequent steps).
  • Allowing a task definition revision to be created from a "template" task definition allows IaC to retain complete control of a task definition, while allowing a CD tool like Octopus to manage routine app deployments.

@PeteGoo
Copy link

PeteGoo commented Aug 26, 2021

Sorry, I got distracted. Here's the follow ups to my original post about our solution.

If the original Terraform template was set to ignore changes to task definition image versions and service task definition revisions, would the proposed step allow you to create your infrastructure with Terraform and update it with Octopus?

Pretty close. I think the env vars is important as is being able to add new sidecars etc.

I’d love to know the details of how you implemented rolling deployments if you’d be willing to share them?

Rolling the ECS cluster was relatively straight forward. We created a powershell script to do the following:

  • Assume the relevant role (cross account etc)
  • Update the Launch Template
  • Suspend Autoscaling processes on the ASG
  • Take a note of all of the existing instance ids
  • Double the cluster size and wait for the new nodes to join the ECS cluster
  • Set the old nodes that we noted earlier to draining in ECS
  • Wait (or timeout) until there are no tasks remaining on those draining nodes
  • Make sure the desired count has been met for all services (covers 0/1 single instance services)
  • Use the ASG Api to remove the drained nodes from the ASG and decrement the desired count
  • Resume the Autoscaling processes on the ASG (finally block)

We have a minimum healthy percentage of 100% and maximum healthy percentage of e.g. 200%. for replica services. Along with the waiting for ECS deployments to fully complete, you can now set an expectation in your containers that we verify as much as possible on container start e.g. config. That way you fail early and deployments are reliable.

  • If an ECS deployment fails to start all the new containers, our deployment has been waiting and sees this, times out and goes into guided failure and notifies the team. Meanwhile, no traffic was sent to the new containers and no users are affected.
  • If a cluster roll cannot reschedule the containers onto the new (non-draining) nodes then the deployment will timeout and notify the team.

....load balancers.....
TBH this is a pain for us because the service has been immutable, we couldn't change the target groups etc and we are unable to easily do this.

@Hawxy
Copy link

Hawxy commented Nov 1, 2021

I see a lot of responses regarding Terraform, but what about those of us using CDK? CDK has no ability to ignore changes to a task definition.

In our case we're using the ApplicationLoadBalancedFargateService pattern, which is what most new AWS users will be recommended if your business involves an AWS Solution Architect.

Looking at my options, it seems like that passing in the image version as an environment variable and doing a force deployment might be the best option for us, ignoring this milestone entirely.

@rhysparry
Copy link

@Hawxy

I see a lot of responses regarding Terraform, but what about those of us using CDK? CDK has no ability to ignore changes to a task definition.

Indeed. The underlying CloudFormation template generated by the CDK will interfere when we make out-of-band changes to the underlying service (to point to the new task definition).

We're currently investigating to see if there might be a way around this, but I'm not optimistic.

My impression is that the template will need to be run as part of the deployment process by Octopus (and all executions of the template must be run this way). This would allow Octopus the opportunity to inspect the template and make changes as required.

Alternatively, Octopus could pass the required parameters to the generator of the template.

I'd love to hear more about your use of CDK.

@rhysparry
Copy link

The "Update Amazon ECS Service" step is now available in Octopus from version 2021.3.9061. It's available for download for self-hosted customers or on Octopus Cloud.

The key aspects of this milestone delivered are:

  • Update an existing ECS service by creating a new task definition based on the previous one
  • Updates the container image reference
  • Preserves, replaces or merges environment variables

We don't yet have support for using EC2 instance roles or assuming IAM roles when working with our ECS cluster target, but we are planning to add this support in the coming months.

Documentation for the step can be found here: https://octopus.com/docs/deployments/aws/ecs-update-service

If you'd like to chat about the step or how you use ECS, I have a Calendly page where you can book some time in my calendar: https://calendly.com/rhysparry

Alternatively, feel free to comment on this issue. I will close it out next week.

@Hawxy
Copy link

Hawxy commented Feb 7, 2022

Just to provide an update if any CDK users stumble upon this thread, we ended up implementing our CDK deployments in Octo back in December which went well.

Our github actions pipeline builds/versions the image, pushes it github packages, then we package up CDK and push it to Octopus. Once that's complete an Octopus release is created with the relevant version number, and CDK is extracted into our worker image.

Within CDK we added a contextual variable that allows the image version to be passed in via command line:

    const containerVersion = this.node.tryGetContext('containerVersion') as string | undefined
    if(!containerVersion) {
        throw Error("Container version must be declared to deploy this stack. Set it with --context containerVersion=x.x.x")
    } 

CDK kicks off and handles the complete task definition update + deployment of the image.

@rhysparry
Copy link

Thanks for the info @Hawxy I'm glad to hear you found a solution. I've recently been learning about the CDK and used environment variables to achieve the desired result. My script step essentially looked like:

export TENTACLE_REPOSITORY=$(get_octopusvariable "Octopus.Action.Package[tentacle].PackageId")
export TENTACLE_VERSION=$(get_octopusvariable "Octopus.Action.Package[tentacle].PackageVersion")

cdk deploy --require-approval never --progress events --no-color 2>&1

I used a package reference to specify the package, making the version (tag) configurable when creating the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants