Support request for 404 errors whilst scaling the gateway #749

matzhaugen · 2021-01-27T12:15:44Z

Expected Behaviour

When a new gateway pod is created in a HA environment we expect the endpoint lister to be able to find all the function deployments without problems.

Current Behaviour

In a HA gateway deployment with 4 gateways we sometimes see 1 of the 4 gateways return 404s (Not Found) consistently for some fraction of the calls in the faas-netes container. This is most likely to happen right after the new gateway pod is created. We see about 99% of our errors come from this line and 1% coming from this line.

We run with a setup where function pods are being terminated within 10 seconds of having a call completed unless other connections from a gateway is opened.

A typical scenario will look as follows a new gateway is created and we see that some fraction of calls are giving 404s

In the above pic, new gateways were spawned right after 9.40am when the 404s start to show up.

We have also set gateway.directFunctions=false and run all calls asynchronously, through the nats queue and the queueworkers. So if my interpretation of the call path is correct it should be: Gateway -> faas-netes -> nats-queue -> queue-worker -> Function pod (correct?) We are seeing the 404 in the faas-netes container so I suspect the nats queue is innocent.

Possible Solution

We're not sure how to solve this. I suspect something is corrupting the cache of the k8s client, perhaps because function pods are going in and out of existence so fast. Then when we ask for the entry in the cache here, it wont be successful. But I would expect then all gateways to be equally affected, and not just one.

Steps to Reproduce (for bugs)

Unfortunately, this bug is difficult to reproduce locally due to the complex call pattern that we have. One might be able to reproduce this by creating a scenario where pods are being called at frequencies less than 1s.

Context

This issue is vital to solve due to the unexpected nature of the 404.

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ): NA
Docker version: containerd://1.3.2g
Kubernetes version v1.17.14-gke.400
Operating System and version (e.g. Linux, Windows, MacOS): GKE linux Container-Optimized OS from Google
Link to your project or a code example to reproduce issue: NA

The text was updated successfully, but these errors were encountered:

alexellis · 2021-01-27T12:30:24Z

FYI you are asking for support on a private, forked version of openfaas

https://github.com/cognitedata/faas-netes/blob/dcc3e502db13190d282529809c6d243b82ed1637/pkg/k8s/proxy.go#L78
https://github.com/cognitedata/faas-netes/blob/dcc3e502db13190d282529809c6d243b82ed1637/pkg/k8s/proxy.go#L82

I'm not sure how we can be of help there.

We did offer you support and consulting, which is available to any user via the support page

There's no obligation for you to take us up on this offer, of course - that's the joy of open source.

matzhaugen · 2021-01-27T13:21:10Z

The links are updated to the non-forked version now, the fork should not have any impact on this. There is of course no expectation that you'll help us, I was just asking.

alexellis · 2021-01-29T13:48:41Z

Hi @matzhaugen thanks for updating the links, it would be helpful to have the logs if you can share them?

I spent 3 hours trying to debug for you this on my environment and couldn't reproduce the error - I was generating load with hey at a steady rate, with one function then scaling the amount of gateways. I only got 200s, no 404s.

It would help to know more contextual information:

What cloud do you run on? Can you repro on your local environment?
What exact versions of each function are you using? Are you using scale to zero?
How many replicas of each function do you have?
How many functions?
What language is the function written in?
What ingress controller are you using?

Are you hoping that I will spend more time investigating this, or are you willing to put in the work to give a reproducible scenario? Without a series of steps to reproduce this, and without any support arrangement, it makes it challenging to spend much more time doing an investigation. I hope you understand?

If you can provide a repro with step-by-step instructions, then it would make it easier for me and others to spend time on it.

What workarounds have you considered? Perhaps you can simply retry if you get a 404 for a function that you know exists in your series of definitions? Perhaps you can have a static amount of gateway pods?

So if my interpretation of the call path is correct it should be: Gateway -> faas-netes -> nats-queue -> queue-worker -> Function pod (correct?)

Not quite, this is the actual flow:

GW -> NATS (end of transaction)
queue-worker < NATS
queue-worker -> gateway -> faas-netes -> endpoint list -> specific endpoint IP

Alex

matzhaugen · 2021-01-29T15:21:36Z

@alexellis, thanks for having invested so much time into this. I'll see if I can come up with a reproducible local scenario, no need to spend more time on it.

As far as workarounds are concerned, we can do a retry, although sometimes the gateway will only give 404 and no 200 after some time point, usually right after it starts. Sometimes it recovers, as seen in the graph above, other times it never recovers and we have to manually kill it. In the latter case, retries will not be feasible. We could also patch a health probe with unhealthy if #404:#200 > X on the gateway, although that seems like patchwork, no pun intended.

To answer your questions:

What cloud do you run on? Can you repro on your local environment?

GKE. We cannot reproduce this locally yet.

What exact versions of each function are you using? Are you using scale to zero?

What do you mean by version of function? We are using faas-netes 0.12.4. We are also using the idler with scale to zero. We haven't bumped in a while. In fact, the manifest was happily running without problems for 3 months, which is why it's so surprising that random gateways would start to fail. As I'm looking at the updated openfaas code, it seems like the idler is not available in the community edition anymore so bumping in the short term might be out of the question.

How many replicas of each function do you have?

We scale dynamically based on # calls, from 0 to 12 pods. This is part of why it might be hard to reproduce locally. One would suspect that this might be the key to the issue, but if only 1 out of 10 gateways bows down, I find it hard to believe that the problem has to do with that.

How many functions?

From 3 to 30.

What language is the function written in?

Python 3.7

What ingress controller are you using?

We have no ingress on the gateway as communication is piped through a service on top of the gateway. The load balancing is in fact working fine, with even distribution on the gateways.

See for example this setup with 10 gateways

alexellis · 2021-01-29T17:07:12Z

What exact versions of each function are you using? Are you using scale to zero?

The watchdog and the template, I had a typo in my original response. Are you using forked or standard versions of both?

matzhaugen · 2021-02-01T08:36:26Z

We are using the classic watchdog with one patch, which should be unrelated to this. The template is our own template, again should be unrelated to this.

A more standard 404 pattern is seen below, with an even distribution across gateways.

alexellis · 2021-02-01T10:30:23Z

Thanks for sharing your Grafana dashboard. Let us know when you have a minimal reproduction that can be shared, or update this issue.

francisdb · 2021-02-03T09:19:13Z

We also have a 404 issue when adding the 4th gateway

alexellis · 2021-02-03T09:24:36Z

@francisdb feel free to provide instructions to reproduce the issue you're facing, and it may make sense for you to detail exactly how you're using openfaas, as I imagine it's completely different to Cognite who have their own custom fork of several OpenFaaS components.

Also can you say who you are representing when you say, "we"?

francisdb · 2021-02-03T09:32:31Z

Just wanted to post a heads-up, if this is not helpful you can just ignore my message until we come up with more info.
Why does it matter who "we" are? (as long it's not Cognite)

alexellis · 2021-02-03T11:20:51Z

It's good practice to introduce yourself when you want help and to participate within a community. Perhaps you could say what your use-case is to help us understand who we're talking to and how we can best help you? A good overview exists here on opensource.com

There was a good example of this by Cognite last year -> #599 (comment)

francisdb · 2021-02-03T12:10:56Z

Let me ask @Sam-Persoon at waylay.io to come up with more details. He's got better manners than me 😬

alexellis · 2021-02-03T13:33:06Z

Ok that helps, we know Waylay and Giles.

gillesdemey · 2021-02-03T13:54:28Z

Hey @alexellis,

Apologies for some of the communication in this thread, we were still compiling some data and didn't mean to bump before we had something meaningful to contribute to the issue.

We at Waylay are experiencing similar behavior. We are slowly ramping up our load to test things out and when adding a 4th replica of the gateway, we noticed the exact same behavior as described above. To provide some context:

What cloud do you run on? Can you repro on your local environment?

GCP, locally not really reproducable as we only need 4 replica's on a very heavy load (+25.000 executions/min)

What exact versions of each function are you using? Are you using scale to zero?

The faas-idler is enabled

How many replicas of each function do you have?

1 or 2

How many functions?

About 800 total, but only around 20 are called. All others are scaled to 0 and are never scaled up.

What language is the function written in?

Nodejs

What ingress controller are you using?

We use the GKE ingress controller, but the Openfaas gateway uses a ClusterIP service and is called by a bespoke proxy.

With 3 gateway's we can handle 20.000 executions/min smoothly (impressive!).

We were running a load test and ramping up the load up while scaling the gateway up. When a 4th replica is needed a lot of errors started coming in.

If we scale up further the 5th replica will again behave nicely, but the 4th replica keeps having issues. Almost all requests to it (which are around 5k/min) , return function not found. Restarting the 4th replica did not seem to alter its behaviour.

On our Grafana dashboard we can see (and I've annotated) when we've added, removed and then added again a fourth replica (the area chart is stacked :))

Let me know if you need any more information or want us to perform additional tests, we'd be happy to help!

alexellis · 2021-02-03T13:58:31Z

Thank you for the extra information.

This sounds like a very different issue. Cognite are pushing very low volumes of data and use a custom fork. You are trying to find the breaking point in openfaas on GKE.

This looks like something we would need to debug with you on a high touch engagement. Let's follow up on email?

I would suggest a separate issue but I feel this is very specific to each company rather than a generic situation. If we can find an issue by working with either company then a generic issue would make sense along with whatever else we need to get a resolution.

In the interim, if you have anything else we could use to look into the problem, feel free to share it here.

alexellis · 2021-02-03T15:13:16Z

@gillesdemey would you mind sharing a copy of your dashboard JSON via a gist and a link to the issue?

We also had a related request from the community to see the full dashboard - to see how well Grafana is rendering 800 deployments (that is optional to provide here)

We have some ideas on how to investigate this and a number of ideas for overcoming the problem, if you want to follow-up with me.

alexellis · 2021-02-03T18:26:41Z

One more question for both (although I think that I know the answer for Cognite) - operator or faas-netes, which are you using in your chart?

i.e. are you both using operator: true?

Do you have any logs from client-go about rate-limiting?

What's the minimum case I need @gillesdemey to reproduce this?

What Prometheus query or metric name? What does it show, is it 404 - or some other error code?

If it's 404, then it may be related to the issue, if it's 502, then it's likely unrelated to what @matzhaugen reported, and due to the loads you are creating. There are tuning instructions for load testing

So far I've spent around six hours debugging this and am yet to reproduce it on a bare-metal test cluster with k3s.

The HTTP server which is used for CRUD and invocations should not be started until the cached informers for endpoints is ready. This may be related to issue #749 The cache sync duration is now being logged and measured for users to provide logs and additional information. Signed-off-by: Alex Ellis (OpenFaaS Ltd) <[email protected]>

alexellis · 2021-02-05T09:39:38Z

Hi @matzhaugen @gillesdemey I still haven't reproduced this issue, but this new release may resolve part or all of the issue by preventing the gateway from serving traffic until it has synchronised the initial list of function endpoints.

https://github.com/openfaas/faas-netes/releases/tag/0.12.16

alexellis · 2021-02-06T09:49:14Z

/set title: Support request for 404 errors whilst scaling the gateway

alexellis · 2021-02-06T09:49:26Z

/add label: support,question

alexellis · 2021-02-09T11:31:03Z

@matzhaugen and @gillesdemey - what results have you seen with the new release of faas-netes? This change was made to support your use of OpenFaaS, so when you have a moment, it'd be good to know if it was helpful or whether we need to do more on this.

gillesdemey · 2021-02-09T15:23:48Z

Hey @alexellis thanks for investigating, much appreciated!

We've re-run our load testing experiment and the behavior no longer occurs with version 0.12.16 of the faas-netes component 💪

We've gone as far as running 4 to 5 replicas sustaining about 40k reqs/min and ran that for about half an hour. We can handle the same amount of traffic with fewer replicas but in this experiment we weren't trying to find a breaking point.

Sadly we cannot share our Grafana dashboard with you since the metrics are generated from a bespoke internal proxy that communicates with the openfaas gateway.

alexellis · 2021-02-09T16:27:52Z

We need the specific metric that you are tracking, is it one we expose in openfaas in the gateway?

Glad we could be of help to you here. There are further changes that could scale to much higher numbers but require a bigger investment of R&D.

gillesdemey · 2021-02-09T17:40:02Z

Sadly no, it's a metric that is emitted by our proxy component that includes some tags that are specific to our platform

matzhaugen · 2021-02-12T13:28:56Z

After deploying the new fix with the faas-netes master, we now see that some of the gateway deployments hang and get stuck in a CrashLoopBackOff only to be fixed when deleting and restarting the corrupted pod. This is likely due to the cache sync hanging in the following code clock, never reaching the log statement that signals that the cache sync is complete (See image below)

As for possible reasons, perhaps large changes in the deployments is causing the cache to get corrupted somehow, or that it takes more than 30 seconds to sync the cache sometimes. Sometimes, before the fix above was in we saw that it took ~30 min for the gateway to fix itself (see figure below)
. With the new fix, the gateway deployment wont even start before it's marked as unhealthy.

There are similar situations posted in the kubernetes repo, but faas-netes doesn't seem to use a shared informer so I'm not sure if this is relevant.

alexellis · 2021-02-12T13:42:40Z

Thanks for trying the patch.

I wonder how we can reproduce this to help you with testing? Do you have any suggestions?

Have you thought of having a static amount of gateway replicas?

And of increasing the initial http healthcheck to a higher number? 30s maybe?

There's several more ideas we have, but it needs an investment of a week or two of R&D plus a reproducible environment/setup.

With the new fix, the gateway deployment wont even start before it's marked as unhealthy.

Do you mean it won't start until marked as healthy? It's only healthy after passing the HTTP health check, which it won't until the cache has synchronised.

matzhaugen · 2021-02-12T13:49:44Z

Do you mean it won't start until marked as healthy? It's only healthy after passing the HTTP health check, which it won't until the cache has synchronised.

I just mean that becuase the cache sync is hanging, it doesn't pass the health check before the pod restart trigger kicks in. It's a good idea to extend the initial health check.

matzhaugen · 2021-02-17T09:24:05Z

I tried increasing the initial health check to 30 seconds and unfortunately it didn't help, there are still newly started gateways that are stuck in a CrashBackoffLoop, and a restart is not guaranteed to fix the issue.

Given that prior to the patch, the gateways will sync their cache eventually, although it might take time, and after the patch, the gateways wont even start even with a 30s healthcheck delay, a more desirable state for us is the former, as we can always retry the 404s. After reverting the patch on our fork, it seems clear that the endpointsInformer cache sync is the culprit. I suspect that this is an artifact of having a very large cluster with a lot of moving pieces, causing the cache to have trouble syncing, and also making it difficult to reproduce locally. Perhaps, if we issue a high rate of scale-requests for the functions and at the same time spinning up a new gateway with other requests we could get an approximate emulation of the production environment. To put things in context a bit we have ~600 functions deployed in one cluster, all on one namespace, although we see this long wait times for cache sync in other clusters with only 100 functions.

In the rare event that the gateway is only returning 404s one would have to restart manually, but that has not happened in a long time. We have a retry in the queue-worker, but I wonder if maybe a retry in proxy.go or in the provider
?

alexellis · 2021-02-17T10:10:50Z

My current challenges are:

I tried deploying ~ 500 functions with varying amounts of desired/actual replicas, but couldn't reproduce the problem on a bare-metal cluster on Equinix Metal. This took me several days to set up and test.
There is no repro given yet
It's seemingly only affecting one end-user company at present
I have 3-5 other things to try which would take 1-2 weeks' of R&D, but nobody has stepped forward to fund the work

How many replicas of the functions do you have in each case?

Is the underlying issue coupled to the amount of gateway replicas that you are scheduling and Kubernetes rate-limiting? We should see errors from client-go if you enable the right log level verbosity.

Is the cache sync stuck due to an exponential back-off of API requests, or is it stuck due to a race condition?

alexellis · 2021-02-17T21:37:59Z

Feel free kick the tires with this release: https://github.com/openfaas/faas-netes/releases/tag/0.12.18

It should resolve the issue with the gateway getting stuck in the wait for cache sync.

matzhaugen · 2021-02-19T16:13:17Z

Hey Alex, thanks for the new release. I've tried the new patch on my fork and made a branch that recreates a scenario that produces a number of 404s with a load testing script.

Let me know if something is unclear, or if the api is not used as intended.

alexellis · 2021-02-19T22:05:03Z

How does this compare to where we were 23 days ago? According to the code in the controller, this should now wait for a sync before going any further.

matzhaugen · 2021-02-20T06:44:47Z

I have not been able to reproduce any other 404 errors except operator 2021/02/19 15:41:43 resolver error: cannot find nodeinfo: no subsets available for "nodeinfo.openfaas-fn" in my load testing, so perhaps there are some differences in the nature of the errors from 23 days ago, but I don't know exactly how. We tried the openfaas 0.12.18 tag briefly in our cluster and saw that it was still giving 404s so we reverted and I made a reproducible test case on a kind cluster, on that branch.

According to the code in the controller, this should now wait for a sync before going any further.

Yes, but this is after the operator has started, only a result of scaling up and down.

matzhaugen · 2021-02-26T13:24:16Z

Actually reducing the Cache Expiry time to 100ms in the gateway seemed to get rid of the 404s in a test environment. Im not advocating that this is a solution yet, just an observation.

alexellis · 2021-02-26T15:53:24Z

This is something I was looking into last week with @aledbf and didn't remember that there was a 5s cache lifetime for function availability. Another way to look at this (which is separate to the original issue raised) is to ensure that the function's graceful shutdown time (to drain requests) is higher than the gateway's cache period.

alexellis · 2021-03-05T09:11:41Z

I'm closing this issue given that I hear from @matzhaugen (the issue creator) that he's moved on from where he was using openfaas. If the new team land here, feel free to open a new issue, or talk to us about support https://openfaas.com/support/

alexellis · 2021-03-05T09:11:47Z

/lock

alexellis mentioned this issue Feb 4, 2021

Sync endpoints before starting HTTP server #759

Merged

3 tasks

derek bot changed the title ~~Gateway cannot find function deployments~~ Support request for 404 errors whilst scaling the gateway Feb 6, 2021

derek bot added question support labels Feb 6, 2021

matzhaugen mentioned this issue Feb 19, 2021

Test the image we build in the GitHub Action #710

Open

alexellis closed this as completed Mar 5, 2021

derek bot locked and limited conversation to collaborators Mar 5, 2021

Support request for 404 errors whilst scaling the gateway #749

Support request for 404 errors whilst scaling the gateway #749

Comments

matzhaugen commented Jan 27, 2021 • edited Loading

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

alexellis commented Jan 27, 2021

matzhaugen commented Jan 27, 2021

alexellis commented Jan 29, 2021

matzhaugen commented Jan 29, 2021 • edited Loading

alexellis commented Jan 29, 2021

matzhaugen commented Feb 1, 2021 • edited Loading

alexellis commented Feb 1, 2021

francisdb commented Feb 3, 2021

alexellis commented Feb 3, 2021 • edited Loading

francisdb commented Feb 3, 2021

alexellis commented Feb 3, 2021 • edited Loading

francisdb commented Feb 3, 2021

alexellis commented Feb 3, 2021

gillesdemey commented Feb 3, 2021 • edited Loading

alexellis commented Feb 3, 2021

alexellis commented Feb 3, 2021

alexellis commented Feb 3, 2021 • edited Loading

alexellis commented Feb 5, 2021

alexellis commented Feb 6, 2021

alexellis commented Feb 6, 2021

alexellis commented Feb 9, 2021 • edited Loading

gillesdemey commented Feb 9, 2021

alexellis commented Feb 9, 2021

gillesdemey commented Feb 9, 2021

matzhaugen commented Feb 12, 2021

alexellis commented Feb 12, 2021 • edited Loading

matzhaugen commented Feb 12, 2021 • edited Loading

matzhaugen commented Feb 17, 2021

alexellis commented Feb 17, 2021

alexellis commented Feb 17, 2021

matzhaugen commented Feb 19, 2021

alexellis commented Feb 19, 2021

matzhaugen commented Feb 20, 2021

matzhaugen commented Feb 26, 2021 • edited Loading

alexellis commented Feb 26, 2021

alexellis commented Mar 5, 2021

alexellis commented Mar 5, 2021

matzhaugen commented Jan 27, 2021 •

edited

Loading

matzhaugen commented Jan 29, 2021 •

edited

Loading

matzhaugen commented Feb 1, 2021 •

edited

Loading

alexellis commented Feb 3, 2021 •

edited

Loading

alexellis commented Feb 3, 2021 •

edited

Loading

gillesdemey commented Feb 3, 2021 •

edited

Loading

alexellis commented Feb 3, 2021 •

edited

Loading

alexellis commented Feb 9, 2021 •

edited

Loading

alexellis commented Feb 12, 2021 •

edited

Loading

matzhaugen commented Feb 12, 2021 •

edited

Loading

matzhaugen commented Feb 26, 2021 •

edited

Loading