Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server-acl-init-cleanup returns error of [dial tcp 172.20.0.1:443: connect: connection refused] #1376

Closed
shixuyue opened this issue Jul 27, 2022 · 4 comments
Labels
type/question Question about product, ideally should be pointed to discuss.hashicorp.com

Comments

@shixuyue
Copy link

shixuyue commented Jul 27, 2022

Question

server-acl-init-cleanup job returns error (connect: connection refused):

2022-07-27T06:12:33.892Z [INFO]  waiting for job "consul-server-acl-init" to complete successfully
Error getting job "consul-resource-manager-server-acl-init": Get "https://172.20.0.1:443/apis/batch/v1/namespaces/<myNS>/jobs/consul-server-acl-init": dial tcp 172.20.0.1:443: connect: connection refused

Where the init job is completed successfully

2022-07-27T06:07:08.848Z [ERROR] Failure: creating agent policy - PUT /v1/acl/policy: err="Put "http://consul-server-0.consul-server.<ns>.svc:8500/v1/acl/policy": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"
2022-07-27T06:07:08.848Z [INFO]  Retrying in 1s
2022-07-27T06:07:10.637Z [INFO]  Success: creating agent policy - PUT /v1/acl/policy
2022-07-27T06:07:10.649Z [INFO]  Success: creating server token for consul-server-0.consul-server.<ns>.svc - PUT /v1/acl/token
2022-07-27T06:07:10.654Z [INFO]  Success: updating server token for consul-server-0.consul-server.<ns>.svc - PUT /v1/agent/token/agent
2022-07-27T06:07:10.672Z [INFO]  Success: calling /agent/self to get datacenter
2022-07-27T06:07:10.672Z [INFO]  Current datacenter: datacenter=<dc> primaryDC=<dc>
2022-07-27T06:07:10.690Z [INFO]  Success: getting consul-auth-method ServiceAccount
2022-07-27T06:07:10.693Z [INFO]  Success: getting consul-auth-method-token-fv7vt Secret
2022-07-27T06:07:10.702Z [INFO]  Success: creating auth method consul-k8s-component-auth-method
2022-07-27T06:07:10.702Z [INFO]  server-acl-init completed successfully

CLI Commands (consul-k8s, consul-k8s-control-plane, helm)

delete-completed-job

Helm Configuration

global:
    acls:
        manageSystemACLs: true
        bootstrapToken:
            secretName: <consul-name>-master-token
            secretKey: token
    name: <consul-name>
    datacenter: <dc>
    domain: <consul-name>
dns:
    enabled: false
ui:
    enabled: true
client:
    enabled: false
server:
    replicas: 1
    priorityClassName: infrastructure-apps
    service:
        annotations: |
            "consul.hashicorp.com/service-ignore": "true"
    enabled: true
    storageClass: ebs-gp3
    resources:
        limits:
            memory: "2Gi"
            cpu: "500m"
        requests:
            memory: "500Mi"
            cpu: "200m"

Logs

Shown as above, let me know if you need more information

Current understanding and Expected behavior

So, the clean up job should remove the init job so helm knows that the install is successfully completed. However, since clean up job is errored out, helm never received a signal of succeed, it timed out eventually.

Environment details

The image I am using: hashicorp/consul-k8s-control-plane:0.46.0
The helm command I am using: helm upgrade --install --create-namespace --namespace <ns> consul hashicorp/consul

Additional Context

I am suspecting that the container post-install runs too fast before istio-proxy container finishing its work. And the code here doesnt seem to have a retry logic to handle the situation like this:

if err != nil {
c.UI.Error(fmt.Sprintf("Error getting job %q: %s", jobName, err))
return 1
}

it returns code 1 immediately.

Also, the <ns> is istio-injected namespace.
I only use consul kv feature at this moment, so client is not required, and one server would be sufficient.

@shixuyue shixuyue added the type/question Question about product, ideally should be pointed to discuss.hashicorp.com label Jul 27, 2022
@shixuyue
Copy link
Author

I can somehow confirmed that my guess is correct, I have created a new namespace without istio-proxy injected. Everything works fine.
To fix this, we can either add a retry logic instead of erroring out 1 immediately, or we can pass annotation thru values.yaml to these jobs to not inject istio-proxy, which is not currently supported at this moment.
Can someone take a look, I am happy to take this task.

@shixuyue
Copy link
Author

I can solve this by updating istio values. Set values.global.proxy.holdApplicationUntilProxyStarts=true is enough.
HOWEVER, server-init-job will not quit istio-proxy sidecar container, so the cleanup will never remove the init job.
But this is an istio problem rather than consul.
I still feel its better to have an additional annotation section in values.yaml. So in the future, people is able to choose not to have istio-proxy injected.

@lkysow
Copy link
Member

lkysow commented Aug 18, 2022

Hi @shixuyue yes that sound like what could be happening. We'd be happy to review a PR that adds annotation support for the acl cleanup job!

@david-yu
Copy link
Contributor

Closing as the acl-init annotation is now implemented through this PR: #2525

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question Question about product, ideally should be pointed to discuss.hashicorp.com
Projects
None yet
Development

No branches or pull requests

3 participants