filtering out instances that are terminating #37

HeyBillFinn · 2020-04-02T21:55:52Z

Is your feature request related to a problem? Please describe.

Very similar to #16 (but in the opposite direction), how can we ensure that instances that are in the process of terminating (or very recently terminated) do not get served any requests? I'd like to solve the race condition described by the diagram below.

NGINX            Instance A              AWS
  |  health check   |                 |
  ------------------>                 |
  |                 |                 |
  |       ok        |                 |
  <------------------                 |
  |                 | scale in        |
  |                 <------------------
  |                 | *transition to  |
  |                 | "Terminating"   |
  |                 |                 |
  |  HTTP request   |                 |
  ------------------>                 |
  |                 |                 |
  |    times out    |                 |
X <------------------                 |
  |                 |                 |
  |                 |                 |
  |  health check   |                 |
  ------------------>                 |
  |                 |                 |
  |      bad        |                 |
  <------------------                 |
  | *mark A                           |
  | as unhealthy                      |
  |                                   |
  |                                   |
  | get current ASG state             |
  ------------------------------------>
  |                                   |
  |         instance A has terminated |
  <-----------------------------------|
  | *remove A from                    |
  | ASG upstream group                |

Describe the solution you'd like

I'd like GetPrivateIPsForScalingGroup to optionally filter out instances (controlled by a remove_terminating config flag) that are in a state of Terminating, Terminating:Wait, Terminating:Proceed, or Terminated per this life cycle chart:

Source: https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html

~~In the API, I believe these map to some combination of instance states of shutting-down, terminated, stopping, and stopped.~~ Unfortunately, these life cycle states do not cleanly map to the instance states returned by describe-instances.

Describe alternatives you've considered

Custom script on our deploy instances that remove themselves from the NGINX upstream group via the NGINX API.

The text was updated successfully, but these errors were encountered:

pleshakov · 2020-04-03T19:11:08Z

Hi @HeyBillFinn

Thanks for the detailed proposal.

I think we can filter out non-InService instances.

However, I can still see a potential problem during termination. Based on your diagram, we have a time period between the instance start *transition to "Terminating" and get current ASG (which is <= the sync_interval_in_seconds (default is 5s)). So even though we can filter out non-inServcie instances in get current ASG state, up until that point, NGINX Plus will continue to have the terminating instance in its upstream.

I wonder if you think this could cause problems? Do your instances terminate immediately or go into Terminating:Wait state?

HeyBillFinn · 2020-04-03T19:36:03Z

I wonder if you think this could cause problems? Do your instances terminate immediately or go into Terminating:Wait state?

Users can set a EC2_INSTANCE_TERMINATING lifecycle hook with a heartbeat timeout that is greater than the NGINX poll interval. That will cause the instance to transition to Terminating:Wait (while it's still successfully servicing requests), nginx-asg-sync sees the instance is no longer InService & deletes it from the upstream group, and, finally, heartbeat timeout expires & AWS terminates the instance. (We haven't implemented this approach yet, but this is our plan.)

For what it's worth, I now realize that the response value of EC2's DescribeInstances is not sufficient to filter out terminating instances because the instance state returned is always "running" even when the instance is in a Terminating:Wait state. Apparently, the auto-scaling instance state is different from the vanilla instance state and doesn't intuitively map (e.g., one might expect Terminating:Wait to map to "shutting-down" or "stopping").

So, I re-added the autoscalingiface that was removed in #29 to use it for querying auto-scaling instances via DescribeAutoScalingInstances. I would prefer to gather all of the instance IDs and make a single AWS request to DescribeAutoScalingInstances, but my golang skills are lacking.

I tested this patch, and it fixes the problem I described in the original issue.

diff --git a/cmd/sync/aws.go b/cmd/sync/aws.go
index bd6ba05..d0a7bab 100644
--- a/cmd/sync/aws.go
+++ b/cmd/sync/aws.go
@@ -8,6 +8,8 @@ import (
 	"github.com/aws/aws-sdk-go/aws"
 	"github.com/aws/aws-sdk-go/aws/ec2metadata"
 	"github.com/aws/aws-sdk-go/aws/session"
+	"github.com/aws/aws-sdk-go/service/autoscaling"
+	"github.com/aws/aws-sdk-go/service/autoscaling/autoscalingiface"
 	"github.com/aws/aws-sdk-go/service/ec2"
 	"github.com/aws/aws-sdk-go/service/ec2/ec2iface"
 	yaml "gopkg.in/yaml.v2"
@@ -15,8 +17,9 @@ import (
 
 // AWSClient allows you to get the list of IP addresses of instanes of an Auto Scaling group. It implements the CloudProvider interface
 type AWSClient struct {
-	svcEC2 ec2iface.EC2API
-	config *awsConfig
+	svcEC2         ec2iface.EC2API
+	svcAutoscaling autoscalingiface.AutoScalingAPI
+	config         *awsConfig
 }
 
 // NewAWSClient creates and configures an AWSClient
@@ -87,8 +90,10 @@ func (client *AWSClient) configure() error {
 		return err
 	}
 
+	svcAutoscaling := autoscaling.New(session)
 	svcEC2 := ec2.New(session)
 	client.svcEC2 = svcEC2
+	client.svcAutoscaling = svcAutoscaling
 	return nil
 }
 
@@ -155,7 +160,18 @@ func (client *AWSClient) GetPrivateIPsForScalingGroup(name string) ([]string, er
 	for _, res := range response.Reservations {
 		for _, ins := range res.Instances {
 			if len(ins.NetworkInterfaces) > 0 && ins.NetworkInterfaces[0].PrivateIpAddress != nil {
-				result = append(result, *ins.NetworkInterfaces[0].PrivateIpAddress)
+				asg_params := &autoscaling.DescribeAutoScalingInstancesInput{
+					InstanceIds: []*string{
+						aws.String(*ins.InstanceId),
+					},
+				}
+				asg_response, asg_err := client.svcAutoscaling.DescribeAutoScalingInstances(asg_params)
+				if asg_err != nil {
+					return nil, asg_err
+				}
+				if len(asg_response.AutoScalingInstances) > 0 && *asg_response.AutoScalingInstances[0].LifecycleState == "InService" {
+					result = append(result, *ins.NetworkInterfaces[0].PrivateIpAddress)
+				}
 			}
 		}
 	}
diff --git a/cmd/sync/main.go b/cmd/sync/main.go
index 25aa50a..b3aaad3 100644
--- a/cmd/sync/main.go
+++ b/cmd/sync/main.go
@@ -17,7 +17,7 @@ import (
 
 var configFile = flag.String("config_path", "/etc/nginx/config.yaml", "Path to the config file")
 var logFile = flag.String("log_path", "", "Path to the log file. If the file doesn't exist, it will be created")
-var version = "0.4-1"
+var version = "0.4-1-debug"
 
 const connTimeoutInSecs = 10

HeyBillFinn · 2020-04-03T20:08:19Z

Quick follow-up here:

It's possible to hit a throttling rate limit with the patch above.

2020/04/03 19:27:13 Couldn't get the IP addresses for <group name>: Throttling: Rate exceeded

So, we'll be looking to consolidate the DescribeAutoScalingInstances into a single HTTP request with space-separated instance IDs per polling period.

I hit the rate limit with 2 groups (same name prefix and used a wildcard to pick up both groups), each having 30 instances.

pleshakov · 2020-04-03T21:42:48Z

So, we'll be looking to consolidate the DescribeAutoScalingInstances into a single HTTP request with space-separated instance IDs per polling period.

This seems like a right approach. However, I'm slightly confused by what you mean by"space-separated instance IDs".

If you're interested in sending a PR, we'd be happy to review and help!

HeyBillFinn · 2020-04-03T23:39:42Z

So, we'll be looking to consolidate the DescribeAutoScalingInstances into a single HTTP request with space-separated instance IDs per polling period.

This seems like a right approach. However, I'm slightly confused by what you mean by"space-separated instance IDs".

Ugh, that was my talking about how the aws autoscaling CLI arguments are passed, rather than how the AWS Go SDK.

It looks like DescribeAutoScalingInstances has a max of 50 instance IDs per request. Otherwise, we see errors like:

ValidationError: The number of instance ids that may be passed in is limited to 50

Proposed fix for race condition described in nginxinc#37.

pleshakov added the proposal An issue that proposes a feature request label Apr 2, 2020

HeyBillFinn added a commit to angaza/nginx-asg-sync that referenced this issue Apr 4, 2020

aws.GetPrivateIPsForScalingGroup: Only return InService instances

2cbf87c

Proposed fix for race condition described in nginxinc#37.

HeyBillFinn mentioned this issue Apr 4, 2020

aws.GetPrivateIPsForScalingGroup: Only return InService instances #38

Closed

6 tasks

pleshakov mentioned this issue May 21, 2020

Add InService option for AWS #39

Merged

Dean-Coakley closed this as completed Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filtering out instances that are terminating #37

filtering out instances that are terminating #37

HeyBillFinn commented Apr 2, 2020 •

edited

Loading

pleshakov commented Apr 3, 2020

HeyBillFinn commented Apr 3, 2020

HeyBillFinn commented Apr 3, 2020

pleshakov commented Apr 3, 2020

HeyBillFinn commented Apr 3, 2020

filtering out instances that are terminating #37

filtering out instances that are terminating #37

Comments

HeyBillFinn commented Apr 2, 2020 • edited Loading

pleshakov commented Apr 3, 2020

HeyBillFinn commented Apr 3, 2020

HeyBillFinn commented Apr 3, 2020

pleshakov commented Apr 3, 2020

HeyBillFinn commented Apr 3, 2020

HeyBillFinn commented Apr 2, 2020 •

edited

Loading