Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watching stops working after 10 minutes #85

Closed
PeterGerrard opened this issue Aug 18, 2020 · 44 comments · Fixed by #150
Closed

Watching stops working after 10 minutes #85

PeterGerrard opened this issue Aug 18, 2020 · 44 comments · Fixed by #150
Assignees

Comments

@PeterGerrard
Copy link

Watching a set of configmaps fails to be alerted of new changes after 10 minutes of no changes.

##Repro steps

  1. Install the following configmap into a cluster
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-test-dashboard
  labels:
    grafana_dashboard: "1"
data:
  cm-test.json: {}
  1. Install the sidecar into the cluster
apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
    - env:
        - name: METHOD
        - name: LABEL
          value: grafana_dashboard
        - name: FOLDER
          value: /tmp/dashboards
        - name: RESOURCE
          value: both
      image: kiwigrid/k8s-sidecar:0.1.178
      imagePullPolicy: IfNotPresent
      name: grafana-sc-dashboard
  1. Wait 10 minutes
  2. Make a change to the config map and update in the cluster

Expected Behaviour

Will see a modification occur

Actual Behaviour

Nothing


Done on AKS with kubernetes version 1.16.10

@ahiaht
Copy link

ahiaht commented Aug 28, 2020

I have same issue with aks 1.16.10 and sidecar 0.1.151.
When the sidecar startup, it ran ok, however the loop failed after 30 minutes with this error:
[2020-08-27 15:25:47] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

@axdotl
Copy link
Contributor

axdotl commented Sep 17, 2020

@monotek Could you please take care of it?

@monotek
Copy link
Contributor

monotek commented Sep 17, 2020

Please try with current version 0.1.193.

@auroaj
Copy link

auroaj commented Sep 17, 2020

Please try with current version 0.1.193.

The same issue.

@axdotl
Copy link
Contributor

axdotl commented Sep 18, 2020

As a containment I'm using SLEEP mode, which actually polls the kube-api.

But what's causing the problem in WATCH is currently unclear for me. Maybe the used k8s client in the sidecar having issues...

I created #90, maybe this helps 🤞🏼

@monotek
Copy link
Contributor

monotek commented Sep 21, 2020

@auroaj & @PeterGerrard
#90 was merged.
please try with 0.1.209

@monotek
Copy link
Contributor

monotek commented Sep 22, 2020

@axdotl
mentioning sleep command, do you think this could help too: https://github.com/kiwigrid/k8s-sidecar/pull/88/files ?

@qaiserali
Copy link

Any updates on this? I'm getting the same issue and sidecar stops working after couple of minutes.

@monotek
Copy link
Contributor

monotek commented Sep 29, 2020

No, as you can see there was no feedback if it works with image 0.1.209.

@qaiserali
Copy link

No, as you can see there was no feedback if it works with image 0.1.209.

I have tried it with image 0.1.209, and it doesn't work

@monotek
Copy link
Contributor

monotek commented Sep 30, 2020

Do anyone of you know whats the last working version?

@axdotl
Copy link
Contributor

axdotl commented Sep 30, 2020

I think it is a general issue. Run long time w/ 0.1.20 where the issue also occurs... maybe it is more related to changes in k8s api.

@monotek
Copy link
Contributor

monotek commented Sep 30, 2020

So lastg working k8s version would be interesting too.

@pulledtim
Copy link
Contributor

The title says after 10min it stops to work, I adjusted the test to add a new configmap after 11min and it works with all k8s versions. Can anyone say after what time they observed the problem?

@axdotl
Copy link
Contributor

axdotl commented Oct 14, 2020

My assumption is, that this is related to interrupts to kube-api. This might cause the resource watching to stop.

@auroaj
Copy link

auroaj commented Oct 16, 2020

The title says after 10min it stops to work, I adjusted the test to add a new configmap after 11min and it works with all k8s versions. Can anyone say after what time they observed the problem?

3-4 hours for me.

@auroaj
Copy link

auroaj commented Oct 20, 2020

Checked with kiwigrid/k8s-sidecar:1.1.0 and AKS K8s Rev: v1.17.11.
The same issue.

@PoliM
Copy link
Contributor

PoliM commented Oct 27, 2020

There is an interesting fix in the Kubernetes Python Client v12.0.0. From the https://github.com/kubernetes-client/python/blob/release-12.0/CHANGELOG.md

Retry expired watches kubernetes-client/python-base#133

@PoliM
Copy link
Contributor

PoliM commented Oct 28, 2020

Thanks for merging. I updated the deployment yesterday to the docker image tag 0.1.259 and this morning, 15 hours later, it still detects modifications on configmaps 👍
There is also a change in the log. About every 30 to 60 minutes there's an entry like:

[2020-10-28 07:29:07] ApiException when calling kubernetes: (410)
Reason: Expired: too old resource version: 194195784 (199007602)

And so the resource watcher gets restarted.

BTW, the tag 1.2.0 had a build error, that's why I used 0.1.259 from the CI build.

@auroaj
Copy link

auroaj commented Oct 28, 2020

Hmmm...
Not in my case...
Still the same issue.
Also, I can't find this entry:

ApiException when calling kubernetes: (410)

I've checked twice, with grafana-sc-dashboard kiwigrid/k8s-sidecar:0.1.259, it stops working.

K8s Rev: v1.17.11

I'll try to check it in another location.

@qaiserali
Copy link

I'm also still getting an issue with version 'k8s-sidecar:0.1.259'. It stops working after a couple of minutes

@auroaj
Copy link

auroaj commented Oct 28, 2020

Checked in another location. Unfortunately, it was updated to v1.17.11 too.
I've got the same result - just several minutes (instead of hours with prev) before stop.
And I've found this line in logs in one of my envs (the only one):

[2020-10-28 14:03:21] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

@djsly
Copy link

djsly commented Nov 10, 2020

are we sure that 0.1.259 matches the 1.2.0 release ? the deploy stage failed: https://app.circleci.com/pipelines/github/kiwigrid/k8s-sidecar/47/workflows/f0000c91-ba71-42b7-828d-0f235915ab29/jobs/274

@OmegaVVeapon
Copy link

OmegaVVeapon commented Nov 10, 2020

are we sure that 0.1.259 matches the 1.2.0 release ? the deploy stage failed: https://app.circleci.com/pipelines/github/kiwigrid/k8s-sidecar/47/workflows/f0000c91-ba71-42b7-828d-0f235915ab29/jobs/274

It seems to be.

The only commit in the 1.2.0 release was the kubernetes library bump to 12.0.0 and if you check the libraries in that image, you'll see that the kubernetes library is updated.

❯ docker run --rm -it kiwigrid/k8s-sidecar:0.1.259 sh
/app $ pip list | grep kubernetes
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
kubernetes        12.0.0

However, I still had the same issue as @qaiserali where it stopped working after a few minutes.

@djsly
Copy link

djsly commented Nov 25, 2020

Any luck from anyone with a more recent version ? We moved to SLEEP method, but now the dashboards are not getting removed on ConfigMap deletion :(

@chadlwilson
Copy link

chadlwilson commented Nov 25, 2020

@djsly We noticed exactly the same thing about dashboards not getting deleted.

With a specific use of this sidecar (Grafana) earlier today, this behaviour under SLEEP mode caused a denial-of-service on one of our Grafana environments when a config map was effectively renamed (same content, one ConfigMap deleted, another one added). As a result of the old name dashboard not being deleted; and then the new one being detected as not having a unique uid led to grafana/grafana#14674. Apart from hundreds of thousands of logged Grafana errors, this seemingly caused the dashboard version to churn in the Grafana DB between two different versions; bloating our DB and eventually running out of space for the grafana DB on our PVs. Cool. :-)

So we went back to WATCH mode; and upgraded the sidecar to 1.3.0 which includes the kube client bump.

In our case WATCH mode was only occasionally losing connectivity/stopping watching that we had noticed, rather than the every 10 minutes/few hours that some people have observed here. Since we run on AWS EKS, one theory was that it was during control plane/master upgrades by AWS that the watches might get terminated and not re-established reliably, but that was just a theory given how infrequent we had experienced the issues with WATCH. Will see how we go.

@vsabella
Copy link

I think there was a similar issue with the notifications API that affected aadpodidentity as well.
The fix was to update to the latest Kubernetes libraries which had a fix for this. But not sure if that applies to the python libraries as well. Either way it doesn't work behind a load balancer at the moment.

@stale
Copy link

stale bot commented Sep 6, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@visand
Copy link

visand commented Oct 6, 2021

I'm also seeing this issue. Running quay.io/kiwigrid/k8s-sidecar:1.12.3 on AKS. Did you all work around this issue somehow or should the issue be reopened?

@diversit
Copy link

diversit commented Nov 5, 2021

Still experiencing this issue on AKS with k8s-sidecar:1.14.2.

Is there still the problem that when using LIST deleted resources are not deleted?

@jekkel
Copy link
Member

jekkel commented Nov 5, 2021

Is there still the problem that when using LIST deleted resources are not deleted?

yes. We're not keeping track of resources, with WATCH the sidecar follows the events k8s emits (create/update/delete) whereas SLEEP more or less acts like create/update all the time.

@gowrisankar22
Copy link

Issue is reproducible with 1.14.2 sidecar as well.

[2021-11-10 03:47:48] ProtocolError when calling kubernetes: ("Connection broken: ConnectionResetError(104, 'Connection reset by peer')", ConnectionResetError(104, 'Connection reset by peer'))

@jekkel
Copy link
Member

jekkel commented Nov 10, 2021

Thanks for the report. Unfortunately I have no idea what's going wrong here. Any ideas? Current state is that we pass urllib Retry configured by REQ_RETRY_TOTAL ,REQ_RETRY_CONNECT, REQ_RETRY_READ, REQ_RETRY_BACKOFF_FACTOR, REQ_TIMEOUT into k8s client, so in general k8s communication should be retried. But from your report it seems that watching, i.e. streaming updates, is not really subject to those retries.

So I'd be happy to follow any pointers you might have.

@diversit
Copy link

Unfortunately I have no idea either.
It seems to be retrying but each retry seems to always fail with the same ProtocolError (see above).

@bergerx
Copy link

bergerx commented Dec 17, 2021

Anyone here is familiar with how the network timeouts are configured when using watches/informers with client-go?
I previously missed this comment #85 (comment)
I strongly believe this has something at the network level.

@bergerx
Copy link

bergerx commented Dec 17, 2021

Ah, I always thought this repo is in golang. Now I checked the code for the first time with @vsliouniaev's kubernetes-client/python#1148 (comment) comment in mind.
And I see that we can test this out pretty easily.

Apparently, details about these are now covered in https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md.
Server-side timeout has a default (a random value between 1800 and 3600 seconds) in the python library, but the client-side timeout seems to have None as default:

We can give the below change a try here.

Update this part:

additional_args = {
'label_selector': label_selector
}
if namespace != "ALL":
additional_args['namespace'] = namespace
stream = watch.Watch().stream(getattr(v1, _list_namespace[namespace][resource]), **additional_args)

As this:

    additional_args = {
        'label_selector': label_selector,

        # Tune default timeouts as outlined in
        # https://github.com/kubernetes-client/python/issues/1148#issuecomment-626184613
        # https://github.com/kubernetes-client/python/blob/master/examples/watch/timeout-settings.md
        # I picked 60 and 66 due to https://github.com/nolar/kopf/issues/847#issuecomment-971651446

        # 60 is a polite request to the server, asking it to cleanly close the connection after that.
        # If you have a network outage, this does nothing.
        # You can set this number much higher, maybe to 3600 seconds (1h).
        'timeout_seconds': os.environ.get(WATCH_SERVER_TIMEOUT, 60),

        # 66 is a client-side timeout, configuring your local socket.
        # If you have a network outage dropping all packets with no RST/FIN,
        # this is how long your client waits before realizing & dropping the connection.
        # You can keep this number low, maybe 60 seconds.
        '_request_timeout': os.environ.get(WATCH_CLIENT_TIMEOUT, 66),
    }
    ...
    stream = watch.Watch().stream(getattr(v1, _list_namespace[namespace][resource]), **additional_args)

This is also effectively what the alternative kopf based implementation does here, also see nolar/kopf#585 on the historical context on these settings:
https://github.com/OmegaVVeapon/kopf-k8s-sidecar/blob/main/app/sidecar_settings.py#L58-L70:

And, ironically the kopf (which OmegaVVeapon/kopf-k8s-sidecar is based on) project has nolar/kopf#847 currently open which seems to be related, but I guess that's another edge case. We have been using kopf in our AKS clusters without regular issues. But this particular kiwigrid/k8s-sidecar issue is quite frequent.

I'll try to give my suggestion above a chance if I can reserve some time, but given that we already have a workaround in place (grafana/helm-charts#18 (comment)), it won't likely be soon.

jekkel added a commit that referenced this issue Dec 20, 2021
Add proposed timeout configuration parameters.
@jekkel jekkel reopened this Dec 20, 2021
@stale stale bot removed the wontfix This will not be worked on label Dec 20, 2021
@jekkel jekkel linked a pull request Dec 20, 2021 that will close this issue
@jekkel
Copy link
Member

jekkel commented Dec 20, 2021

@bergerx Thanks a lot for this detailed analysis. I took the opportunity and incorporated your proposal in a PR. Hopefully we can get this issue fixed with it. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.