Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Sticky session correction for AWS ALB. Fixes #1572 #1577

Merged
merged 9 commits into from
Dec 14, 2021

Conversation

derjust
Copy link
Contributor

@derjust derjust commented Oct 12, 2021

When sticky sessions are activated on the Ingress for AWS ALB:

Annotations:  
              alb.ingress.kubernetes.io/target-group-attributes: stickiness.enabled=true,stickiness.lb_cookie.duration_seconds=60
...

also targetGroup stickiness must be configured.
Otherwise the error

InvalidLoadBalancerAction: You must enable group stickiness on a rule if you enabled target stickiness on one of its target groups

is reported by the AWS ALB.

Therefore this PR adds support for AWS ALB TargetGroupStickinessConfig following the same pattern as the additionalIngressAnnotationsfor Istio on the Rollout:

spec:
  progressDeadlineSeconds: 1800
  replicas: 2
  strategy:
    canary:
      canaryService: argo-test-canary
      stableService: argo-test-stable
      steps:
      - setWeight: 50
      - pause: {}
      - setWeight: 100
      trafficRouting:
        alb:
          ingress: argo-test-alb
          servicePort: 80
          stickinessConfig:
            durationSeconds: 60
            enabled: true

Screen Shot 2021-10-12 at 1 08 41 AM


Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@codecov
Copy link

codecov bot commented Oct 12, 2021

Codecov Report

Merging #1577 (7c0b424) into master (9d32c13) will increase coverage by 0.06%.
The diff coverage is 80.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1577      +/-   ##
==========================================
+ Coverage   81.97%   82.04%   +0.06%     
==========================================
  Files         116      116              
  Lines       15929    16099     +170     
==========================================
+ Hits        13058    13208     +150     
- Misses       2201     2217      +16     
- Partials      670      674       +4     
Impacted Files Coverage Δ
utils/annotations/annotations.go 97.29% <ø> (ø)
utils/ingress/ingress.go 100.00% <ø> (ø)
rollout/trafficrouting/alb/alb.go 82.26% <60.00%> (-2.69%) ⬇️
ingress/alb.go 96.55% <100.00%> (+0.30%) ⬆️
rollout/temlateref.go 82.98% <100.00%> (+2.44%) ⬆️
rollout/pause.go 95.33% <0.00%> (ø)
rollout/restart.go 98.64% <0.00%> (ø)
rollout/replicaset.go 67.59% <0.00%> (ø)
analysis/controller.go 52.17% <0.00%> (ø)
utils/metric/metric.go 100.00% <0.00%> (ø)
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d32c13...7c0b424. Read the comment docs.

@harikrongali
Copy link
Contributor

linking #1572

@@ -243,6 +243,16 @@ func getForwardActionString(r *v1alpha1.Rollout, port int32, desiredWeight int32
TargetGroups: targetGroups,
},
}

var stickinessConfig = r.Spec.Strategy.Canary.TrafficRouting.ALB.StickinessConfig
if stickinessConfig.Enabled {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is default false?
also can you check nil for stickinessConfig ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if stickinessConfig.Enabled {
newStickyConfig := ingressutil.ALBTargetGroupStickinessConfig{
Enabled: true,
DurationSeconds: stickinessConfig.DurationSeconds,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add validation for stickinessConfig.DurationSeconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@harikrongali I'm debating what kind of validation would be reasonable here.
To a certain degree i would the AWS API fail to get its validation result. What did you had in mind here? Verifying that the value is > 0? Or the same range that is currently employed by the AWS API which is 1-604800 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same range as specified by AWS API should be good

@harikrongali
Copy link
Contributor

please add e2e test

@harikrongali
Copy link
Contributor

please fix CI failues

@derjust
Copy link
Contributor Author

derjust commented Oct 22, 2021

Thank you for the review. Working on this.

@derjust derjust force-pushed the ALBStickyConfig branch 3 times, most recently from e69ddd8 to efdabca Compare October 28, 2021 01:19
@derjust
Copy link
Contributor Author

derjust commented Nov 18, 2021

I was wondering how this PR can further proceed?

VERSION Outdated Show resolved Hide resolved
@harikrongali
Copy link
Contributor

@derjust We will review and get this sooner. Meantime, can you rebase to latest as CI checks are failing

canaryService := fmt.Sprintf("%s-canary", serviceName)
albActionKey := albActionAnnotation(serviceName)
managedBy := fmt.Sprintf("%s:%s", rollout, albActionKey)
action := fmt.Sprintf(actionTemplate, serviceName, port, canaryService, port)
var template string
Copy link
Contributor

@harikrongali harikrongali Nov 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we refactor to

if includeStickyConfig {
     action := fmt.Sprintf(actionTemplateWithStickyConfig, serviceName, port, canaryService, port)
}

managedByValue := fmt.Sprintf("%s:%s", managedBy, albActionAnnotation(stableSvc))
action := fmt.Sprintf(actionTemplate, canarySvc, port, weight, stableSvc, port, 100-weight)
var action string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor to

action := fmt.Sprintf(actionTemplate, canarySvc, port, weight, stableSvc, port, 100-weight)
if includeStickyConfig {
		action = fmt.Sprintf(actionTemplateWithStickyConfig, canarySvc, port, weight, stableSvc, port, 100-weight)
}

@harikrongali
Copy link
Contributor

@derjust overall looks good. Thanks for the contribution. As soon as the comments & CI failures are fixed, I will get this merged

@harikrongali
Copy link
Contributor

Please run codegen locally and commit changes generated from it.

@derjust
Copy link
Contributor Author

derjust commented Nov 18, 2021

Thank you for the extensive review - it's my first Go PR so I appreciate all the effort.
Working on this now

@derjust derjust force-pushed the ALBStickyConfig branch 3 times, most recently from 97e5bb2 to 55791d4 Compare November 19, 2021 05:48
@harikrongali
Copy link
Contributor

@derjust there are some unit tests failing with nil pointer exceptions.

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1993061]

can you run "make test" locally?

@harikrongali
Copy link
Contributor

@@ -1,3 +1,4 @@
//go:build !ignore_autogenerated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code-gen failing for this line. please remove this line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you; undone.
Does the code-gen has an issue running on Mac?

@derjust
Copy link
Contributor Author

derjust commented Nov 23, 2021

Unit tests work locally (Note: they fail on OSX with a permission denied - which makes the list test fail as the clock (I guess) can't be reset.
DCO also fixed. Thank you for the ongoing support!

@harikrongali
Copy link
Contributor

harikrongali commented Nov 23, 2021

@derjust seems the code-gen that you are running has issues https://github.com/argoproj/argo-rollouts/runs/4303557116?check_suite_focus=true . can you let me know if you followed https://argoproj.github.io/argo-rollouts/CONTRIBUTING/
installation

brew install go kubectl kustomize golangci-lint protobuf swagger-codegen

make codegen - Runs the code generator that creates the informers, client, lister, and deepcopies from the types.go and modifies the open-api spec.

@derjust
Copy link
Contributor Author

derjust commented Nov 26, 2021

@harikrongali thank you for your continued help.
Yes i did so - ran it again, too:

$ brew install go kubectl kustomize golangci-lint protobuf swagger-codegen

Warning: go 1.17.2 is already installed and up-to-date.
To reinstall 1.17.2, run:
  brew reinstall go
Warning: kubernetes-cli 1.22.4 is already installed, it's just not linked.
To link this version, run:
  brew link kubernetes-cli
Warning: kustomize 4.4.1 is already installed and up-to-date.
To reinstall 4.4.1, run:
  brew reinstall kustomize
Warning: golangci-lint 1.43.0 is already installed and up-to-date.
To reinstall 1.43.0, run:
  brew reinstall golangci-lint
Warning: protobuf 3.17.3 is already installed and up-to-date.
To reinstall 3.17.3, run:
  brew reinstall protobuf
Warning: swagger-codegen 3.0.29 is already installed and up-to-date.
To reinstall 3.0.29, run:
  brew reinstall swagger-codegen

To help it i force-use GoLang 1.16 - for some reason TestFunctionalSuite/TestWorkloadRef still fails :-(
But this also fails in master 47d59fa for me - i doubt that it is a problem in the repo but with my steup - so I'm not sure what's off and if I'm chasing white elephants here.
Here is the output from the controller at the time when the TestWorkloadRef test fails:

INFO[2021-11-26T19:03:04-05:00] rollout syncHandler queue retries: 11 : key "default/rollout-ref-deployment"  namespace=default rollout=rollout-ref-deployment
E1126 19:03:04.412350   61396 controller.go:174] deployments.apps "default/rollout-ref-deployment" not found
ERRO[2021-11-26T19:03:04-05:00] Cannot update the workload-ref/annotation for rollout-ref-deployment/default
ERRO[2021-11-26T19:03:04-05:00] Cannot update the workload-ref/annotation for rollout-ref-deployment/default
ERRO[2021-11-26T19:03:04-05:00] Cannot update the workload-ref/annotation for rollout-ref-deployment/default

@derjust
Copy link
Contributor Author

derjust commented Nov 28, 2021

Did more investigation and posted my findings around my codegen issue - and failing e2e test - as new issue:
#1675

Hope this can shed some light?

alexmt and others added 8 commits December 1, 2021 10:29
Adds support for AWS ALB [TargetGroupStickinessConfig](https://aws.amazon.com/blogs/aws/new-application-load-balancer-simplifies-deployment-with-weighted-target-groups/)

This is required to support sticky session on the listener level while Argo is using ALB's weighting

Signed-off-by: Sebastian J <[email protected]>
Signed-off-by: Sebastian J <[email protected]>
Signed-off-by: Sebastian J <[email protected]>
Signed-off-by: Sebastian J <[email protected]>
Signed-off-by: Sebastian J <[email protected]>
Signed-off-by: Sebastian J <[email protected]>
Forced codegen via downgrading to Go 1.16:

```
$ env|grep GO

GOPATH=/Users/sebastian/go
```

```
$ go version

go version go1.16.10 darwin/amd64
```

```
$ echo $PATH

/Users/sebastian/.sdkman/candidates/micronaut/current/bin:/Users/sebastian/.sdkman/candidates/java/current/bin:/Users/sebastian/.cargo/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Library/TeX/texbin:/usr/local/MacGPG2/bin:/usr/local/share/dotnet:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/bin:/Users/sebastian/go/bin
```

Signed-off-by: Sebastian J <[email protected]>
@derjust
Copy link
Contributor Author

derjust commented Dec 3, 2021

A PR for the issue in #1675 was provided.
As it is already approved i assume it will be merged to master soon - so i already re-based this PR on it already.

Signed-off-by: Sebastian J <[email protected]>
@sonarcloud
Copy link

sonarcloud bot commented Dec 3, 2021

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
5.2% 5.2% Duplication

@harikrongali
Copy link
Contributor

@derjust sorry, I am out the last couple of weeks. I will get this merged this week.

@harikrongali
Copy link
Contributor

@alexmt can you merge the PR. e2e test failed is a flaky one not related the current change.

@alexmt alexmt merged commit 5e188f9 into argoproj:master Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants