Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long running PHP PubSub process that eats all cpu exactly 1h into processing #2338

Closed
udAL opened this issue Sep 19, 2019 · 82 comments
Closed
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@udAL
Copy link

udAL commented Sep 19, 2019

This is more of a plea for help than a bug report..

  • OS: Dockerized Ubuntu 16.04.6 (Modified gcr.io/google-appengine/php72)
  • PHP version: 7.2.22
  • Package name and version:
    google/cloud-pubsub 1.11.0
    google/cloud-core 1.27.0
    (outdated because of compatibility reasons)

We have been using PubSub for a while for managing background tasks. We have a GCE vm that has supervisord with two long running php processes that check for new messages and process whatever. It looks something like this:

use Google\Cloud\PubSub\PubSubClient;
$pubSub = new PubSubClient([
    'keyFilePath' => 'includes/secret.json'
]);
$subscription = $pubSub->subscription('workers');
while(true)
{
    $message = false;
    try
    {
        foreach ($subscription->pull([
            'maxMessages' => 1
        ]) as $pullMessage)
        {
            $message = $pullMessage;
        }
    }
    catch( Exception $e)
    {
        // Nothing
    }
    if($message)
    {
        // Do stuff
        $subscription->acknowledge($message);
    }
}

As discussed on #1986, since we started we saw this errors popping up every hour:

[19-Sep-2019 16:30:30 Europe/Madrid] PHP Fatal error:  Uncaught Google\Cloud\Core\Exception\ServiceException: cURL error 35: gnutls_handshake()$
Stack trace:
#0 /app/vendor/google/cloud-core/src/RequestWrapper.php(189): Google\Cloud\Core\RequestWrapper->convertToGoogleException(Object(GuzzleHttp\Exce$
#1 /app/vendor/google/cloud-core/src/RestTrait.php(95): Google\Cloud\Core\RequestWrapper->send(Object(GuzzleHttp\Psr7\Request), Array)
#2 /app/vendor/google/cloud-pubsub/src/Connection/Rest.php(193): Google\Cloud\PubSub\Connection\Rest->send('subscriptions', 'pull', Array)
#3 /app/vendor/google/cloud-pubsub/src/Subscription.php(409): Google\Cloud\PubSub\Connection\Rest->pull(Array)
#4 /app/worker.php(77): Google\Cloud\PubSub\Subscription->pull(Array)
#5 {main}
  thrown in /app/vendor/google/cloud-core/src/RequestWrapper.php on line 336

It wasn't a big deal for us since we had our supervisord to keep the processes open. It wasn't till a few days ago when we updated our docker (which we do frequently). Suddenly our worker vm's were constantly at 100% cpu, even though there wasn't anything to process on PubSub and no task processing. We could restart the processes and they would work fine for exactly one hour, then they'll pop to 100% cpu without apparent reason. We're stuck with an essential server that needs restarting every 60 min.

We have done a lot of debugging and the only difference we found between the old and new machines it's that ubuntu package libcurl3-gnutls was updated from <7.47.0-1ubuntu2.12> to <7.47.0-1ubuntu2.14>. We haven't found a way to downgrade.

We have tried to force Guzzle to use TLS 1.1. PubSub allows to pass parameters to it like this:

$subscription->pull([
    'maxMessages' => 1,
    'restOptions' => [
        'curl' => [
            CURLOPT_SSLVERSION => CURL_SSLVERSION_TLSv1_0
        ]
    ]
])

But it hasn't work... Not with CURL_SSLVERSION_TLSv1 neither with CURL_SSLVERSION_TLSv1_0 or CURL_SSLVERSION_TLSv1_2
I know this is a complex error, probably has nothing to do with google/cloud-pubsub, and this is an outdated version. Maybe it's a TLS incompatibility with google pubsub service. Maybe an outdated library in an updated os. I really don't know, but we'll take any suggestion you can give us... Any ideas?

Tomorrow i'll try with a clean project with updated packages and will post.

@dwsupplee dwsupplee added api: pubsub Issues related to the Pub/Sub API. type: question Request for information or clarification. Not an issue. labels Sep 19, 2019
@dwsupplee
Copy link
Contributor

@udAL that sounds really frustrating, 🙁. We'll definitely take a look to see where/how we can help. Do you have the protobuf/gRPC extensions enabled? If so, would you be able to share versions of each?

As an aside, would you be able to share the comparability issues you're experiencing? Maybe there's something we can do there to help as well.

@udAL
Copy link
Author

udAL commented Sep 20, 2019

Hi @dwsupplee
I understand you're referring to google/protobuf and grpc/grpc composer packages. We did have them installed:

  • grpc/grpc 1.19.0
  • google/protobuf v3.7.1

Our compatibility issues come mainly from requirements of package kreait/firebase-php. If I uninstall it and update I can get:

  • google/cloud-core v1.33.0 (latest)
  • google/cloud-pubsub v1.17.0 (latest)
  • google/protobuf v3.9.1 (latest)
  • grpc/grpc 1.22.0 (latest)

I'll do some tests to see if this was part of the problem.
Thanks

@udAL
Copy link
Author

udAL commented Sep 20, 2019

With last versions of the packages the result it's the same, 100% cpu after 1h into the script.
I'm sure this is related to libcurl and cURL error 35: gnutls_handshake() since this error used to pop up at 1h and has been replaced with this 100% spikes at 1h, so I'll get into that. Following the cURL error 35 I found suggestions to force cURL to use TLS1.0, do you know if my code it's correct when passing parameters to Guzzle? Any ideas on what else I could try?

@udAL
Copy link
Author

udAL commented Sep 20, 2019

To the best of my knowledge the connection to PubSub freezes 1h into it. Don't know how or why, but since PubSub it's usually used on long running processes, isn't that something that should be taken into account?

To hotfix the problem, we had installed cron and restarted the process every 50 min but that has it's own problems since a task could be cough in the middle and, not being acknowledged, would be reprocessed. I have put a timer in the script to destroy and create the PubSub object to reconnect and this should hotfix the problem for now.

@udAL
Copy link
Author

udAL commented Sep 20, 2019

This occurred also on google-cloud-python around Dec 2017:
googleapis/google-cloud-python#4600
It was fixed by requiring python package grpcio >= 1.8.2:
googleapis/google-cloud-python#4642

Does that help?

@udAL
Copy link
Author

udAL commented Sep 26, 2019

Is there anything being done to solve this?
Otherwise, how should I proceed?
Can you help?

@illambo
Copy link

illambo commented Sep 26, 2019

About cURL error 35: gnutls_handshake() we experience a similar issue in one app hosted on flexible engine php (see screenshot), and we use a supervisord in order to maintain stable the worker.

Another thing, for this application we have different envs each in a different project, currently we find this problem only in one of the three projects, maybe try to redistribute in a new project, maybe the problem is for specific network zone.

Schermata 2019-09-26 alle 15 06 27

  • google/cloud-pubsub: v1.17.0
  • google/cloud-core: v1.33.0
  • google/protobuf: v3.9.1
  • grpc/grpc: 1.22.0

@dwsupplee
Copy link
Contributor

I understand you're referring to google/protobuf and grpc/grpc composer packages.

Actually, I was wondering about the extensions available on PECL (protobuf/grpc) :).

I have some time set aside next week to help look in to this, my apologies for the delay. We've had a team summit this week.

@udAL
Copy link
Author

udAL commented Sep 30, 2019

I did not have protobuf nor grpc extensions installed.
I have added those and it seems this solves the issue. I'll keep testing to be sure.

  • ext-protobuf: 3.7.0
  • ext-grpc: 1.19.0

Shouldn't this extensions be required on the google/cloud-core?

EDIT: False 'all is good' alarm. It still pops up to 100% CPU

@dwsupplee
Copy link
Contributor

Shouldn't this extensions be required on the google/cloud-core?

Not all our libraries which depend on core require the extensions, and often times users may find themselves on something like shared hosting where the extensions aren't even an option. As a result we've worked to make sure they aren't hard requirements.

False 'all is good' alarm. It still pops up to 100% CPU

Gah, sorry to hear that. I'm setting up an environment to do some testing here soon.

@udAL
Copy link
Author

udAL commented Oct 7, 2019

Any news?

@dwsupplee
Copy link
Contributor

@udAL I've been working on reproducing. Would you be able to share the modified dockerfile you're using?

@jdpedrie
Copy link
Contributor

@udAL are you able to share this additional info? :)

@udAL
Copy link
Author

udAL commented Nov 4, 2019

Hi, sorry for the delay.
For some reason the package libcurl3-gnutls falled back on it's own to 7.47.0-1ubuntu2.12 (instead of 7.47.0-1ubuntu2.14) so right now we don't have the 100% CPU problem and we have the good old cURL error 35: gnutls_handshake().

Here you can find a project that can reproduce the error.

It requires:

  • PubSub json key file at includes/pubsub_secret.json
  • PubSub topic with pull subscription named workers_test (in my case with no expiration and 7 days message retention)

To deploy:

  • Build and push to Container registry
  • Deploy a GCE instance with this image

To reproduce:

  • SSH into machine
  • docker container ls (copy first contianer id)
  • docker exec -it [ID] bash
  • Wait 1h
  • Take a look at app/worker.err.log

You don't need to send any message to workers_test.
Be aware that the process will auto reset because of supervisord.

Thanks

@udAL
Copy link
Author

udAL commented Nov 18, 2019

Any progress?

@udAL
Copy link
Author

udAL commented Jan 3, 2020

@jdpedrie Could you reproduce the error?

@udAL
Copy link
Author

udAL commented Jan 20, 2020

Hello there!
It's me again!
Is there going to be any action done? Should we stop waiting and look at other options? Anything?

@jdpedrie
Copy link
Contributor

Hi @udAL,

I'm really sorry for the lengthy delay. I'm looking into this now.

@jdpedrie
Copy link
Contributor

jdpedrie commented Jan 22, 2020

How many messages, on average, does the worker generally process before the CPU usage spikes? I've been running your sample application for quite some time (nearly 90 minutes) without seeing any sign of unusually high CPU usage. :-/

Have you tried using a specific tag of the gcr.io/google-appengine/php72 image, rather than latest? Perhaps pinning to a version released before the issue appeared may help, or at least give us a hint as to the source of the problem.

@udAL
Copy link
Author

udAL commented Jan 22, 2020

The error I'm getting now it's the cURL error 35: gnutls_handshake() not the high CPU usage as stated here:

For some reason the package libcurl3-gnutls falled back on it's own to 7.47.0-1ubuntu2.12 (instead of 7.47.0-1ubuntu2.14) so right now we don't have the 100% CPU problem and we have the good old cURL error 35: gnutls_handshake().

It pops up exactly at 60 minutes. Have you seen the error on app/worker.err.log ?

It doesn't seem to depend on how many messages it processes. On tests the error showed without processing any messages.

Have you tried using a specific tag of the gcr.io/google-appengine/php72 image, rather than latest? Perhaps pinning to a version released before the issue appeared may help, or at least give us a hint as to the source of the problem.

I could, but what version should I use? I don't know any that has worked.
It looks to me that there's a deep bug embedded on whatever library you're using to send requests, something that's sensible to cUrl versions. I could obviously try each version of cUrl to explore if one works but isn't that a that a lame workaround? Should I be stuck with an old cUrl version forever?

Online I've found other cURL error 35: gnutls_handshake() errors with various google/cloud-core libraries. I can't point to what is it since I don't know the inner workings of google-cloud-php library, that's why I come to you in search of help. All the 'fixes' I have found require to tinker with cUrl parameters on the request, something that's not available to us from outside google/cloud-core.

@jdpedrie
Copy link
Contributor

Have you tried running the container outside of GCE? Does the error still happen? If you've not tried, please confirm that. If it works outside GCE, I'll see about having a member of the network team investigate.

@udAL
Copy link
Author

udAL commented Jan 27, 2020

I've just tested both local and GCE.
Confirmed it works outside GCE and still fails inside GCE.
Good call! I'll wait for updates

@jdpedrie
Copy link
Contributor

Thank you for checking. Let me pass this information along to our team and we'll try to see what the next steps are. Thank you again for your patience, I absolutely know how frustrating this kind of thing can be.

@jdpedrie
Copy link
Contributor

Looks like #2562 may be related.

@kamalaboulhosn
Copy link

@udAL would you be able to reach out to Google Cloud Support and provide your project ID and the name of your GCE VM that was exhibiting the issue? You can reference this GitHub issue and that the Pub/Sub team asked you to provide this information for debugging. Thanks!

@udAL
Copy link
Author

udAL commented Jan 29, 2020

@kamalaboulhosn Absolutely! Sorry if this it's obvious, where should I contact them? I can't find an email or form to send.

@meredithslota
Copy link

Hi @dpassola — my apologies, I neglected to notice that this issue was still assigned to @jdpedrie who is no longer working on this project as of July (apologies for the delay here). I will find a new owner to help investigate.

@dpassola
Copy link

dpassola commented Oct 6, 2021

Hi @meredithslota , i hope this issue, soon, will be fixed... I don't understand how, an issue, is opened more than 2 years...

Please, we need this issue fixed as soon as posible.

Thank's for you reply

@saranshdhingra saranshdhingra self-assigned this Oct 7, 2021
@saranshdhingra
Copy link
Contributor

saranshdhingra commented Oct 7, 2021

Hi @dpassola Let me apologise for this issue staying unresolved for such a long time.
I have cloned the repo mentioned(udAL/gcp-pubsub-worker-test) and I am trying to replicate the issue on local(I know it's not breaking on local, but just a sanity check) and a GCE instance simultaneously.
I will report soon whatever my findings are.

@dpassola
Copy link

dpassola commented Oct 7, 2021

Hi @saranshdhingra , thanks for your reply
I will be waiting for your news impatiently :)

@saranshdhingra
Copy link
Contributor

Hi @dpassola I was able to replicate the issue in a GCE instance.
The instance type was g1-small, however I see in the thread above that you use n1-standard-1.
This should imply that the type of instance isn't a factor, but to confirm that I may create a small terraform script that helps in recreating different types of instances with this image, so that we can surely say if the type of instance isn't a factor here.
But I will surely keep you updated every step of the way.

@dpassola
Copy link

dpassola commented Oct 8, 2021

Hi @saranshdhingra , thank you for your feedback and keep me updated!

@saranshdhingra
Copy link
Contributor

Just to confirm if the issue persists on different machine types, I tested the same on the following machine types:
f1-micro, g1-small, n1-standard-1, n1-standard-2, n1-standard-4, n1-standard-8, n1-highmem-2, n1-highcpu-2, e2-micro, e2-small, e2-medium, e2-standard-2, e2-standard-4, e2-highmem-2, e2-highcpu-2, n2-standard-2, n2-standard-4, n2-highmem-2, n2-highcpu-2
and the issue was replicated in all machines.

For anyone interested I used this terraform script to generate multiple GCE instances.

@saranshdhingra
Copy link
Contributor

Hey @dpassola Just wanted to update you(since it's been a week), we're discussing this internally right now and attempting to zero in on the source.
I will keep you updated :)

@dpassola
Copy link

dpassola commented Oct 19, 2021 via email

@illambo
Copy link

illambo commented Nov 16, 2021

Santa clou(d)s please help us...

@dpassola
Copy link

Hi @saranshdhingra , any news?

@udAL
Copy link
Author

udAL commented Nov 16, 2021

Santa clou(d)s please help us...

I agree it's time to light a huge fire, join hands in a circle around it, look at the skies and pray. When everything else fails...

@saranshdhingra
Copy link
Contributor

Hi @dpassola
Unfortunately, due to the nature of the issue, zeroing in on the exact issue is taking way more time than anticipated.

But so far, I have tried the same setup on node on different machines and it doesn't seem to replicate.
I have tried the same setup of PHP in GKE and it seems to replicate.
For now the primary candidates of inspection are networking(in GCP) and GRPC(in PHP).
Once I have eliminated either of those, I will update here and would focus on the other.

@saranshdhingra
Copy link
Contributor

saranshdhingra commented Nov 24, 2021

Hi @dpassola
I have been checking in different Non GCE environments, both internal and external. I have been testing with other languages(Node, Ruby etc), trying tcpdumps to monitor the moment of exception.

I did come across something today, so I wanted to ask a few questions about the versions of the libraries installed.

In an old message in this thread, it was mentioned:

Our compatibility issues come mainly from requirements of package kreait/firebase-php. If I uninstall it and update I can get:
google/cloud-core v1.33.0 (latest)
google/cloud-pubsub v1.17.0 (latest)
google/protobuf v3.9.1 (latest)
grpc/grpc 1.22.0 (latest)

Can you tell me the current versions that are running?

The reason I ask is that using the docker image from the project specified by @udAL, we are able to replicate the issue on a VM instance but the version of google/cloud-pubsub installed is 1.17, but when I tried a composer update inside the container, the version of google/cloud-pubsub upgrades to 1.34.1 and the issue hasn't been replicated in over a day, i.e. the same process has been running and no error has been logged.

I tried a manual composer update and then as a build step in the Dockerfile just before this line.

You can use the composer show command to find the version of the library installed.

@dpassola
Copy link

Hi @saranshdhingra , here you have our versions showed with "composer show" (there are more libraries installed):

  • google/cloud-core v1.27.0
  • google/cloud-pubsub v1.11.0
  • google/protobuf v3.7.1
  • grpc/grpc 1.19.0

Thank's!

@saranshdhingra
Copy link
Contributor

@dpassola Are you using the same Dockerfile and composer.json(or at least similar) as what is mentioned here. I have been using that as a baseline and testing everywhere.

If you are, then a composer update should return the same version for the same Dockerfile/composer.json pair.

@dpassola
Copy link

@saranshdhingra , yes, our composer.json and Dockerfile are the same mentioned.
We can run composer update in a test project... but not in production without check that there are'nt any issue on update libraries.

@saranshdhingra
Copy link
Contributor

@dpassola Please do so in a test project.
Also, just for brevity, I would mention make sure the versions in composer.json are not hard coded strings but use the caret operator.
Example: google/cloud-pubsub: "^1.11" and not google/cloud-pubsub: "1.11"

@dpassola
Copy link

hi @saranshdhingra , if we put composer update after SUPERVISORD line, this is what it does (we also remove the caret on composer.json):

  • Updating grpc/grpc (1.22.0 => 1.42.0): Downloading (100%)
  • Updating google/protobuf (v3.9.1 => v3.19.1): Downloading (100%)
  • Updating firebase/php-jwt (v5.0.0 => v5.5.1): Downloading (100%)
  • Installing psr/http-factory (1.0.1): Downloading (100%)
  • Updating guzzlehttp/psr7 (1.6.1 => 2.1.0): Downloading (100%)
  • Installing symfony/deprecation-contracts (v2.5.0): Downloading (100%)
  • Installing psr/http-client (1.0.1): Downloading (100%)
  • Updating guzzlehttp/guzzle (6.3.3 => 7.4.0): Downloading (100%)
  • Updating google/auth (v1.5.2 => v1.18.0): Downloading (100%)
  • Updating google/common-protos (1.1 => 1.4.0): Downloading (100%)
  • Updating google/grpc-gcp (0.1.4 => v0.2.0): Downloading (100%)
  • Updating google/gax (1.1.2 => v1.10.0): Downloading (100%)
  • Updating rize/uri-template (0.3.2 => 0.3.4): Downloading (100%)
  • Updating psr/log (1.1.0 => 1.1.4): Downloading (100%)
  • Updating monolog/monolog (2.0.0 => 2.3.5): Downloading (100%)
  • Updating google/cloud-core (v1.33.0 => v1.43.1): Downloading (100%)
  • Downgrading google/cloud-pubsub (v1.17.0 => v1.11.0): Downloading (100%)
  • Downgrading google/cloud-storage (v1.14.0 => v1.11.0): Downloading (100%)
  • Downgrading google/cloud-logging (v1.18.0 => v1.16.0): Downloading (100%)
  • Downgrading google/cloud-error-reporting (v0.14.5 => v0.14.0): Downloading (100%)

we deploy this code and we are waiting for test this new version

@saranshdhingra
Copy link
Contributor

saranshdhingra commented Nov 24, 2021

Hi @dpassola
No, we need to leave the caret and not remove it.

Have the caret present and then try to run composer update

As you can see currently it's downgrading the pubsub library to 1.11 from 1.17 while I expect to see an upgrade.

@dpassola
Copy link

Hi @saranshdhingra , here version of libraries with caret :)
With caret is how we have the composer.json in production.

Sorry for my mistake

List of libraries:

  • Updating grpc/grpc (1.22.0 => 1.42.0): Downloading (100%)
  • Updating google/protobuf (v3.9.1 => v3.19.1): Downloading (100%)
  • Updating firebase/php-jwt (v5.0.0 => v5.5.1): Downloading (100%)
  • Installing psr/http-factory (1.0.1): Downloading (100%)
  • Updating guzzlehttp/psr7 (1.6.1 => 2.1.0): Downloading (100%)
  • Installing symfony/deprecation-contracts (v2.5.0): Downloading (100%)
  • Installing psr/http-client (1.0.1): Downloading (100%)
  • Updating guzzlehttp/guzzle (6.3.3 => 7.4.0): Downloading (100%)
  • Updating google/auth (v1.5.2 => v1.18.0): Downloading (100%)
  • Updating google/common-protos (1.1 => 1.4.0): Downloading (100%)
  • Updating google/grpc-gcp (0.1.4 => v0.2.0): Downloading (100%)
  • Updating google/gax (1.1.2 => v1.10.0): Downloading (100%)
  • Updating psr/log (1.1.0 => 1.1.4): Downloading (100%)
  • Updating monolog/monolog (2.0.0 => 2.3.5): Downloading (100%)
  • Updating rize/uri-template (0.3.2 => 0.3.4): Downloading (100%)
  • Updating google/cloud-core (v1.33.0 => v1.43.1): Downloading (100%)
  • Updating google/cloud-pubsub (v1.17.0 => v1.34.1): Downloading (100%)
  • Updating google/cloud-logging (v1.18.0 => v1.22.2): Downloading (100%)
  • Updating google/cloud-storage (v1.14.0 => v1.25.2): Downloading (100%)

composer.json content: (in grey versions that our IDE detects will install)
image

@dpassola
Copy link

Hi @saranshdhingra , our composer.json is like i show yesterday but we don't have, in our Dockerfile, composer update... we are testing the new configuration.
Thank's

@saranshdhingra
Copy link
Contributor

Hi @dpassola
Hope you had a great Thanksgiving week :)

Were you able to test the process with the new version of libraries?

@dpassola
Copy link

dpassola commented Dec 1, 2021

Hi @saranshdhingra ,thank you and sorry for delay on reply.

We are testing the process. At the moment is working fine!

On next days I'll send you my feedback again

Thanks!

@saranshdhingra
Copy link
Contributor

Hi @dpassola
Great to know it's working fine for now.

I'll be closing this issue for now and would try to dig more into the specific version which fixes this issue and the reason for it.

If you find that the issue is still not fixed, feel free to resume the thread.

Again, apologies for such a delay on the resolution, but I can only hope it works from this point forward :)

@saranshdhingra
Copy link
Contributor

Upon more digging, I found some info which maybe useful for folks who maybe getting this issue in old versions.
The version of cloud-pubsub didn't matter in reproducing the issue.
I think it's the version of google/gax which is responsible.
I found the the issue doesn't exist starting "google/gax": "1.5.0"
I even checked the dependencies(using composer show) in v1.4.0 and v1.5.0 and the same versions of dependencies were loaded.

@dpassola
Copy link

Hello, it has been in production for 17 days and, for now, everything is correct and we have not had to restart it. This is great news! Have a great year! Thank you @saranshdhingra !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. external This issue is blocked on a bug with the actual product. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

8 participants