Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filtering out instances that are terminating #37

Closed
HeyBillFinn opened this issue Apr 2, 2020 · 5 comments
Closed

filtering out instances that are terminating #37

HeyBillFinn opened this issue Apr 2, 2020 · 5 comments
Labels
proposal An issue that proposes a feature request

Comments

@HeyBillFinn
Copy link

HeyBillFinn commented Apr 2, 2020

Is your feature request related to a problem? Please describe.

Very similar to #16 (but in the opposite direction), how can we ensure that instances that are in the process of terminating (or very recently terminated) do not get served any requests? I'd like to solve the race condition described by the diagram below.

NGINX            Instance A              AWS
  |  health check   |                 |
  ------------------>                 |
  |                 |                 |
  |       ok        |                 |
  <------------------                 |
  |                 | scale in        |
  |                 <------------------
  |                 | *transition to  |
  |                 | "Terminating"   |
  |                 |                 |
  |  HTTP request   |                 |
  ------------------>                 |
  |                 |                 |
  |    times out    |                 |
X <------------------                 |
  |                 |                 |
  |                 |                 |
  |  health check   |                 |
  ------------------>                 |
  |                 |                 |
  |      bad        |                 |
  <------------------                 |
  | *mark A                           |
  | as unhealthy                      |
  |                                   |
  |                                   |
  | get current ASG state             |
  ------------------------------------>
  |                                   |
  |         instance A has terminated |
  <-----------------------------------|
  | *remove A from                    |
  | ASG upstream group                |

Describe the solution you'd like

I'd like GetPrivateIPsForScalingGroup to optionally filter out instances (controlled by a remove_terminating config flag) that are in a state of Terminating, Terminating:Wait, Terminating:Proceed, or Terminated per this life cycle chart:

image

Source: https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html

In the API, I believe these map to some combination of instance states of shutting-down, terminated, stopping, and stopped. Unfortunately, these life cycle states do not cleanly map to the instance states returned by describe-instances.

Describe alternatives you've considered

Custom script on our deploy instances that remove themselves from the NGINX upstream group via the NGINX API.

@pleshakov pleshakov added the proposal An issue that proposes a feature request label Apr 2, 2020
@pleshakov
Copy link
Contributor

Hi @HeyBillFinn

Thanks for the detailed proposal.

I think we can filter out non-InService instances.

However, I can still see a potential problem during termination. Based on your diagram, we have a time period between the instance start *transition to "Terminating" and get current ASG (which is <= the sync_interval_in_seconds (default is 5s)). So even though we can filter out non-inServcie instances in get current ASG state, up until that point, NGINX Plus will continue to have the terminating instance in its upstream.

I wonder if you think this could cause problems? Do your instances terminate immediately or go into Terminating:Wait state?

@HeyBillFinn
Copy link
Author

I wonder if you think this could cause problems? Do your instances terminate immediately or go into Terminating:Wait state?

Users can set a EC2_INSTANCE_TERMINATING lifecycle hook with a heartbeat timeout that is greater than the NGINX poll interval. That will cause the instance to transition to Terminating:Wait (while it's still successfully servicing requests), nginx-asg-sync sees the instance is no longer InService & deletes it from the upstream group, and, finally, heartbeat timeout expires & AWS terminates the instance. (We haven't implemented this approach yet, but this is our plan.)

For what it's worth, I now realize that the response value of EC2's DescribeInstances is not sufficient to filter out terminating instances because the instance state returned is always "running" even when the instance is in a Terminating:Wait state. Apparently, the auto-scaling instance state is different from the vanilla instance state and doesn't intuitively map (e.g., one might expect Terminating:Wait to map to "shutting-down" or "stopping").

So, I re-added the autoscalingiface that was removed in #29 to use it for querying auto-scaling instances via DescribeAutoScalingInstances. I would prefer to gather all of the instance IDs and make a single AWS request to DescribeAutoScalingInstances, but my golang skills are lacking.

I tested this patch, and it fixes the problem I described in the original issue.

diff --git a/cmd/sync/aws.go b/cmd/sync/aws.go
index bd6ba05..d0a7bab 100644
--- a/cmd/sync/aws.go
+++ b/cmd/sync/aws.go
@@ -8,6 +8,8 @@ import (
 	"github.com/aws/aws-sdk-go/aws"
 	"github.com/aws/aws-sdk-go/aws/ec2metadata"
 	"github.com/aws/aws-sdk-go/aws/session"
+	"github.com/aws/aws-sdk-go/service/autoscaling"
+	"github.com/aws/aws-sdk-go/service/autoscaling/autoscalingiface"
 	"github.com/aws/aws-sdk-go/service/ec2"
 	"github.com/aws/aws-sdk-go/service/ec2/ec2iface"
 	yaml "gopkg.in/yaml.v2"
@@ -15,8 +17,9 @@ import (
 
 // AWSClient allows you to get the list of IP addresses of instanes of an Auto Scaling group. It implements the CloudProvider interface
 type AWSClient struct {
-	svcEC2 ec2iface.EC2API
-	config *awsConfig
+	svcEC2         ec2iface.EC2API
+	svcAutoscaling autoscalingiface.AutoScalingAPI
+	config         *awsConfig
 }
 
 // NewAWSClient creates and configures an AWSClient
@@ -87,8 +90,10 @@ func (client *AWSClient) configure() error {
 		return err
 	}
 
+	svcAutoscaling := autoscaling.New(session)
 	svcEC2 := ec2.New(session)
 	client.svcEC2 = svcEC2
+	client.svcAutoscaling = svcAutoscaling
 	return nil
 }
 
@@ -155,7 +160,18 @@ func (client *AWSClient) GetPrivateIPsForScalingGroup(name string) ([]string, er
 	for _, res := range response.Reservations {
 		for _, ins := range res.Instances {
 			if len(ins.NetworkInterfaces) > 0 && ins.NetworkInterfaces[0].PrivateIpAddress != nil {
-				result = append(result, *ins.NetworkInterfaces[0].PrivateIpAddress)
+				asg_params := &autoscaling.DescribeAutoScalingInstancesInput{
+					InstanceIds: []*string{
+						aws.String(*ins.InstanceId),
+					},
+				}
+				asg_response, asg_err := client.svcAutoscaling.DescribeAutoScalingInstances(asg_params)
+				if asg_err != nil {
+					return nil, asg_err
+				}
+				if len(asg_response.AutoScalingInstances) > 0 && *asg_response.AutoScalingInstances[0].LifecycleState == "InService" {
+					result = append(result, *ins.NetworkInterfaces[0].PrivateIpAddress)
+				}
 			}
 		}
 	}
diff --git a/cmd/sync/main.go b/cmd/sync/main.go
index 25aa50a..b3aaad3 100644
--- a/cmd/sync/main.go
+++ b/cmd/sync/main.go
@@ -17,7 +17,7 @@ import (
 
 var configFile = flag.String("config_path", "/etc/nginx/config.yaml", "Path to the config file")
 var logFile = flag.String("log_path", "", "Path to the log file. If the file doesn't exist, it will be created")
-var version = "0.4-1"
+var version = "0.4-1-debug"
 
 const connTimeoutInSecs = 10

@HeyBillFinn
Copy link
Author

Quick follow-up here:

It's possible to hit a throttling rate limit with the patch above.

2020/04/03 19:27:13 Couldn't get the IP addresses for <group name>: Throttling: Rate exceeded

So, we'll be looking to consolidate the DescribeAutoScalingInstances into a single HTTP request with space-separated instance IDs per polling period.

I hit the rate limit with 2 groups (same name prefix and used a wildcard to pick up both groups), each having 30 instances.

@pleshakov
Copy link
Contributor

So, we'll be looking to consolidate the DescribeAutoScalingInstances into a single HTTP request with space-separated instance IDs per polling period.

This seems like a right approach. However, I'm slightly confused by what you mean by"space-separated instance IDs".

If you're interested in sending a PR, we'd be happy to review and help!

@HeyBillFinn
Copy link
Author

So, we'll be looking to consolidate the DescribeAutoScalingInstances into a single HTTP request with space-separated instance IDs per polling period.

This seems like a right approach. However, I'm slightly confused by what you mean by"space-separated instance IDs".

Ugh, that was my talking about how the aws autoscaling CLI arguments are passed, rather than how the AWS Go SDK.

It looks like DescribeAutoScalingInstances has a max of 50 instance IDs per request. Otherwise, we see errors like:

ValidationError: The number of instance ids that may be passed in is limited to 50

HeyBillFinn added a commit to angaza/nginx-asg-sync that referenced this issue Apr 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
proposal An issue that proposes a feature request
Projects
None yet
Development

No branches or pull requests

3 participants