-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support request for 404 errors whilst scaling the gateway #749
Comments
FYI you are asking for support on a private, forked version of openfaas https://github.com/cognitedata/faas-netes/blob/dcc3e502db13190d282529809c6d243b82ed1637/pkg/k8s/proxy.go#L78 I'm not sure how we can be of help there. We did offer you support and consulting, which is available to any user via the support page There's no obligation for you to take us up on this offer, of course - that's the joy of open source. |
The links are updated to the non-forked version now, the fork should not have any impact on this. There is of course no expectation that you'll help us, I was just asking. |
Hi @matzhaugen thanks for updating the links, it would be helpful to have the logs if you can share them? I spent 3 hours trying to debug for you this on my environment and couldn't reproduce the error - I was generating load with hey at a steady rate, with one function then scaling the amount of gateways. I only got 200s, no 404s. It would help to know more contextual information:
Are you hoping that I will spend more time investigating this, or are you willing to put in the work to give a reproducible scenario? Without a series of steps to reproduce this, and without any support arrangement, it makes it challenging to spend much more time doing an investigation. I hope you understand? If you can provide a repro with step-by-step instructions, then it would make it easier for me and others to spend time on it. What workarounds have you considered? Perhaps you can simply retry if you get a 404 for a function that you know exists in your series of definitions? Perhaps you can have a static amount of gateway pods?
Not quite, this is the actual flow:
Alex |
@alexellis, thanks for having invested so much time into this. I'll see if I can come up with a reproducible local scenario, no need to spend more time on it. As far as workarounds are concerned, we can do a retry, although sometimes the gateway will only give 404 and no 200 after some time point, usually right after it starts. Sometimes it recovers, as seen in the graph above, other times it never recovers and we have to manually kill it. In the latter case, retries will not be feasible. We could also patch a health probe with To answer your questions:
GKE. We cannot reproduce this locally yet.
What do you mean by version of function? We are using faas-netes 0.12.4. We are also using the idler with scale to zero. We haven't bumped in a while. In fact, the manifest was happily running without problems for 3 months, which is why it's so surprising that random gateways would start to fail. As I'm looking at the updated openfaas code, it seems like the idler is not available in the community edition anymore so bumping in the short term might be out of the question.
We scale dynamically based on # calls, from 0 to 12 pods. This is part of why it might be hard to reproduce locally. One would suspect that this might be the key to the issue, but if only 1 out of 10 gateways bows down, I find it hard to believe that the problem has to do with that.
From 3 to 30.
Python 3.7
We have no ingress on the gateway as communication is piped through a service on top of the gateway. The load balancing is in fact working fine, with even distribution on the gateways. |
The watchdog and the template, I had a typo in my original response. Are you using forked or standard versions of both? |
We are using the classic watchdog with one patch, which should be unrelated to this. The template is our own template, again should be unrelated to this. A more standard 404 pattern is seen below, with an even distribution across gateways. |
Thanks for sharing your Grafana dashboard. Let us know when you have a minimal reproduction that can be shared, or update this issue. |
We also have a 404 issue when adding the 4th gateway |
@francisdb feel free to provide instructions to reproduce the issue you're facing, and it may make sense for you to detail exactly how you're using openfaas, as I imagine it's completely different to Cognite who have their own custom fork of several OpenFaaS components. Also can you say who you are representing when you say, "we"? |
Just wanted to post a heads-up, if this is not helpful you can just ignore my message until we come up with more info. |
It's good practice to introduce yourself when you want help and to participate within a community. Perhaps you could say what your use-case is to help us understand who we're talking to and how we can best help you? A good overview exists here on opensource.com There was a good example of this by Cognite last year -> #599 (comment) |
Let me ask @Sam-Persoon at waylay.io to come up with more details. He's got better manners than me 😬 |
Ok that helps, we know Waylay and Giles. |
Hey @alexellis, Apologies for some of the communication in this thread, we were still compiling some data and didn't mean to bump before we had something meaningful to contribute to the issue. We at Waylay are experiencing similar behavior. We are slowly ramping up our load to test things out and when adding a 4th replica of the gateway, we noticed the exact same behavior as described above. To provide some context:
GCP, locally not really reproducable as we only need 4 replica's on a very heavy load (+25.000 executions/min)
The faas-idler is enabled
1 or 2
About 800 total, but only around 20 are called. All others are scaled to 0 and are never scaled up.
Nodejs
We use the GKE ingress controller, but the Openfaas gateway uses a With 3 gateway's we can handle 20.000 executions/min smoothly (impressive!). We were running a load test and ramping up the load up while scaling the gateway up. When a 4th replica is needed a lot of errors started coming in. If we scale up further the 5th replica will again behave nicely, but the 4th replica keeps having issues. Almost all requests to it (which are around 5k/min) , return On our Grafana dashboard we can see (and I've annotated) when we've added, removed and then added again a fourth replica (the area chart is stacked :)) Let me know if you need any more information or want us to perform additional tests, we'd be happy to help! |
Thank you for the extra information. This sounds like a very different issue. Cognite are pushing very low volumes of data and use a custom fork. You are trying to find the breaking point in openfaas on GKE. This looks like something we would need to debug with you on a high touch engagement. Let's follow up on email? I would suggest a separate issue but I feel this is very specific to each company rather than a generic situation. If we can find an issue by working with either company then a generic issue would make sense along with whatever else we need to get a resolution. In the interim, if you have anything else we could use to look into the problem, feel free to share it here. |
@gillesdemey would you mind sharing a copy of your dashboard JSON via a gist and a link to the issue? We also had a related request from the community to see the full dashboard - to see how well Grafana is rendering 800 deployments (that is optional to provide here) We have some ideas on how to investigate this and a number of ideas for overcoming the problem, if you want to follow-up with me. |
One more question for both (although I think that I know the answer for Cognite) - operator or faas-netes, which are you using in your chart? i.e. are you both using Do you have any logs from client-go about rate-limiting? What's the minimum case I need @gillesdemey to reproduce this? What Prometheus query or metric name? What does it show, is it 404 - or some other error code? If it's 404, then it may be related to the issue, if it's 502, then it's likely unrelated to what @matzhaugen reported, and due to the loads you are creating. There are tuning instructions for load testing So far I've spent around six hours debugging this and am yet to reproduce it on a bare-metal test cluster with k3s. |
The HTTP server which is used for CRUD and invocations should not be started until the cached informers for endpoints is ready. This may be related to issue #749 The cache sync duration is now being logged and measured for users to provide logs and additional information. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>
The HTTP server which is used for CRUD and invocations should not be started until the cached informers for endpoints is ready. This may be related to issue #749 The cache sync duration is now being logged and measured for users to provide logs and additional information. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>
Hi @matzhaugen @gillesdemey I still haven't reproduced this issue, but this new release may resolve part or all of the issue by preventing the gateway from serving traffic until it has synchronised the initial list of function endpoints. https://github.com/openfaas/faas-netes/releases/tag/0.12.16 |
/set title: Support request for 404 errors whilst scaling the gateway |
/add label: support,question |
@matzhaugen and @gillesdemey - what results have you seen with the new release of faas-netes? This change was made to support your use of OpenFaaS, so when you have a moment, it'd be good to know if it was helpful or whether we need to do more on this. |
Hey @alexellis thanks for investigating, much appreciated! We've re-run our load testing experiment and the behavior no longer occurs with version We've gone as far as running 4 to 5 replicas sustaining about 40k reqs/min and ran that for about half an hour. We can handle the same amount of traffic with fewer replicas but in this experiment we weren't trying to find a breaking point. Sadly we cannot share our Grafana dashboard with you since the metrics are generated from a bespoke internal proxy that communicates with the openfaas gateway. |
We need the specific metric that you are tracking, is it one we expose in openfaas in the gateway? Glad we could be of help to you here. There are further changes that could scale to much higher numbers but require a bigger investment of R&D. |
Sadly no, it's a metric that is emitted by our proxy component that includes some tags that are specific to our platform |
After deploying the new fix with the faas-netes master, we now see that some of the gateway deployments hang and get stuck in a CrashLoopBackOff only to be fixed when deleting and restarting the corrupted pod. This is likely due to the cache sync hanging in the following code clock, never reaching the log statement that signals that the cache sync is complete (See image below) As for possible reasons, perhaps large changes in the deployments is causing the cache to get corrupted somehow, or that it takes more than 30 seconds to sync the cache sometimes. Sometimes, before the fix above was in we saw that it took ~30 min for the gateway to fix itself (see figure below) There are similar situations posted in the kubernetes repo, but faas-netes doesn't seem to use a shared informer so I'm not sure if this is relevant. |
Thanks for trying the patch. I wonder how we can reproduce this to help you with testing? Do you have any suggestions? Have you thought of having a static amount of gateway replicas? And of increasing the initial http healthcheck to a higher number? 30s maybe? There's several more ideas we have, but it needs an investment of a week or two of R&D plus a reproducible environment/setup.
Do you mean it won't start until marked as healthy? It's only healthy after passing the HTTP health check, which it won't until the cache has synchronised. |
I just mean that becuase the cache sync is hanging, it doesn't pass the health check before the pod restart trigger kicks in. It's a good idea to extend the initial health check. |
I tried increasing the initial health check to 30 seconds and unfortunately it didn't help, there are still newly started gateways that are stuck in a CrashBackoffLoop, and a restart is not guaranteed to fix the issue. Given that prior to the patch, the gateways will sync their cache eventually, although it might take time, and after the patch, the gateways wont even start even with a 30s healthcheck delay, a more desirable state for us is the former, as we can always retry the 404s. After reverting the patch on our fork, it seems clear that the endpointsInformer cache sync is the culprit. I suspect that this is an artifact of having a very large cluster with a lot of moving pieces, causing the cache to have trouble syncing, and also making it difficult to reproduce locally. Perhaps, if we issue a high rate of scale-requests for the functions and at the same time spinning up a new gateway with other requests we could get an approximate emulation of the production environment. To put things in context a bit we have ~600 functions deployed in one cluster, all on one namespace, although we see this long wait times for cache sync in other clusters with only 100 functions. In the rare event that the gateway is only returning 404s one would have to restart manually, but that has not happened in a long time. We have a retry in the queue-worker, but I wonder if maybe a retry in proxy.go or in the provider |
My current challenges are:
How many replicas of the functions do you have in each case? Is the underlying issue coupled to the amount of gateway replicas that you are scheduling and Kubernetes rate-limiting? We should see errors from client-go if you enable the right log level verbosity. Is the cache sync stuck due to an exponential back-off of API requests, or is it stuck due to a race condition? |
Feel free kick the tires with this release: https://github.com/openfaas/faas-netes/releases/tag/0.12.18 It should resolve the issue with the gateway getting stuck in the wait for cache sync. |
Hey Alex, thanks for the new release. I've tried the new patch on my fork and made a branch that recreates a scenario that produces a number of 404s with a load testing script. Let me know if something is unclear, or if the api is not used as intended. |
How does this compare to where we were 23 days ago? According to the code in the controller, this should now wait for a sync before going any further. |
I have not been able to reproduce any other 404 errors except
Yes, but this is after the operator has started, only a result of scaling up and down. |
Actually reducing the Cache Expiry time to 100ms in the gateway seemed to get rid of the 404s in a test environment. Im not advocating that this is a solution yet, just an observation. |
This is something I was looking into last week with @aledbf and didn't remember that there was a 5s cache lifetime for function availability. Another way to look at this (which is separate to the original issue raised) is to ensure that the function's graceful shutdown time (to drain requests) is higher than the gateway's cache period. |
I'm closing this issue given that I hear from @matzhaugen (the issue creator) that he's moved on from where he was using openfaas. If the new team land here, feel free to open a new issue, or talk to us about support https://openfaas.com/support/ |
/lock |
Expected Behaviour
When a new gateway pod is created in a HA environment we expect the endpoint lister to be able to find all the function deployments without problems.
Current Behaviour
In a HA gateway deployment with 4 gateways we sometimes see 1 of the 4 gateways return 404s (Not Found) consistently for some fraction of the calls in the faas-netes container. This is most likely to happen right after the new gateway pod is created. We see about 99% of our errors come from this line and 1% coming from this line.
We run with a setup where function pods are being terminated within 10 seconds of having a call completed unless other connections from a gateway is opened.
A typical scenario will look as follows a new gateway is created and we see that some fraction of calls are giving 404s
In the above pic, new gateways were spawned right after 9.40am when the 404s start to show up.
We have also set
gateway.directFunctions=false
and run all calls asynchronously, through the nats queue and the queueworkers. So if my interpretation of the call path is correct it should be: Gateway -> faas-netes -> nats-queue -> queue-worker -> Function pod (correct?) We are seeing the 404 in the faas-netes container so I suspect the nats queue is innocent.Possible Solution
We're not sure how to solve this. I suspect something is corrupting the cache of the k8s client, perhaps because function pods are going in and out of existence so fast. Then when we ask for the entry in the cache here, it wont be successful. But I would expect then all gateways to be equally affected, and not just one.
Steps to Reproduce (for bugs)
Unfortunately, this bug is difficult to reproduce locally due to the complex call pattern that we have. One might be able to reproduce this by creating a scenario where pods are being called at frequencies less than 1s.
Context
This issue is vital to solve due to the unexpected nature of the 404.
Your Environment
FaaS-CLI version ( Full output from:
faas-cli version
): NADocker version: containerd://1.3.2g
Kubernetes version v1.17.14-gke.400
Operating System and version (e.g. Linux, Windows, MacOS): GKE linux Container-Optimized OS from Google
Link to your project or a code example to reproduce issue: NA
The text was updated successfully, but these errors were encountered: