Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pub / Sub Subscriber: CPU Usage eventually spikes to 100% #4600

Closed
dhermes opened this issue Dec 15, 2017 · 29 comments
Closed

Pub / Sub Subscriber: CPU Usage eventually spikes to 100% #4600

dhermes opened this issue Dec 15, 2017 · 29 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. release blocking Required feature/issue must be fixed prior to next release. triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@dhermes
Copy link
Contributor

dhermes commented Dec 15, 2017

This is a distillation of this report from many other issues. Large thanks to @anorth2 and @dmontag for reporting and helping refine the issue.

Issues that have been partially resolved (except for the CPU spike) are being collapsed into this one:


Core issue: there is a spinlock bug in gRPC (present in all recent versions 1.6.x, 1.7.x, 1.8.1). A "bandaid" fix exists but has not been merged (as of noon Pacific on December 15, 2017).

Update: The fix was rolled back, but will be rolled forward (grpc/grpc#13918).

Potential workaround: compile grpcio from source (e.g.) while including the bandaid fix or just use the 64-bit manylinux wheel that I already created (I may be open to creating Mac OS X wheels, not sure about Windows)

@dhermes dhermes added api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Dec 15, 2017
@dhermes
Copy link
Contributor Author

dhermes commented Dec 15, 2017

@anorth2 Would you mind trying with the patched grpcio==1.7.4.dev1? (This is a follow-up to #4274 (comment))

@anorth2
Copy link

anorth2 commented Dec 15, 2017

Testing it out now, I'll edit this comment once I have it deployed.

@dhermes Having issues with the wheel, getting this: grpcio-1.7.4.dev1-cp36-cp36m-manylinux1_x86_64.whl is not a supported wheel on this platform. on a python:3-alpine base image. Any suggestions? (We don't do a lot of whl compiling here)

@dhermes
Copy link
Contributor Author

dhermes commented Dec 15, 2017

@anorth2

  1. What command are you running to get that error? (I assume you've downloaded the wheel and are running pip install grpcio....whl)
  2. Feel free to ping me on Hangouts
  3. UPDATE: It looks like manylinux wheels are not compatible with alpine

@dhermes
Copy link
Contributor Author

dhermes commented Dec 20, 2017

Fixed by #4642.

@dhermes dhermes closed this as completed Dec 20, 2017
@arindamchoudhury
Copy link

arindamchoudhury commented Dec 28, 2017

Hi @dhermes ,

I am still having problem with high CPU usage. After around one hour CPU usage is 100%.

google-api-core (0.1.3)
google-auth (1.2.1)
google-cloud-pubsub (0.30.1)
googleapis-common-protos (1.5.3)
grpc-google-iam-v1 (0.11.4)
grpcio (1.8.2)

I am using
Python 3.6.4 (default, Dec 21 2017, 01:35:12)
[GCC 4.9.2] on linux

@dhermes
Copy link
Contributor Author

dhermes commented Jan 2, 2018

@arindamchoudhury Can you confirm that your Python shell is in the same environment where those packages are installed? Also, could you share some code so I might reproduce the CPU spike?

@dhermes
Copy link
Contributor Author

dhermes commented Jan 2, 2018

I just now have run this with google-cloud-pubsub==0.30.1 and grpcio==1.8.2 and see CPU spikes at 100%. Will investigate soon.

@dhermes dhermes reopened this Jan 2, 2018
@arindamchoudhury
Copy link

arindamchoudhury commented Jan 2, 2018

Hi @dhermes ,

If I install grpcio using:

pip install grpcio --ignore-installed --no-binary grpcio

it works fine.

The publish also creates high cpu uses.

@dhermes
Copy link
Contributor Author

dhermes commented Jan 2, 2018

The --no-binary remark is very good to know! Thanks.

dhermes added a commit to dhermes/google-cloud-pubsub-performance that referenced this issue Jan 3, 2018
As suggested [1], this actually stopped the spinlock bug.

[1]: googleapis/google-cloud-python#4600 (comment)
@dhermes
Copy link
Contributor Author

dhermes commented Jan 3, 2018

I filed grpc/grpc#13906 to note the difference in behavior between the source dist and the binary wheels. (Thanks @arindamchoudhury!)

Also note that this issue must remain open since the spinlock fix has been reverted.

@mehrdada
Copy link

mehrdada commented Jan 3, 2018

I posted this comment grpc/grpc#13906 (comment), but it might be worth posting here directly as well:

Our Linux binary wheels target manylinux platform, which is what PyPI expects. manylinux, for compatibility reasons, defines a very old kernel interface that does not support things like epoll system calls. Therefore, when we build the binary wheels to upload to PyPI, the epoll poll strategy will not get in the build and the default will be simple and more portable poll. The code path that seems to trigger the spinlock bug is on gRPC poll polling strategy. When you build from source locally on a modern Ubuntu machine, our more advanced polling strategies (i.e. epoll) get compiled in because your platform is known to support that and gets used by default, because it is supposed to be better, and it also seem to have the side effect of working around the spinlock bug.

You can try forcing the GRPC_POLL_STRATEGY=poll environment variable on your built-from-source version and see if it manifests the problem to confirm this theory. Additionally, it implies building from source on macOS or Windows to sidestep this issue is going to a futile exercise, because the issue likely does not exist on Windows in the first place, and neither gRPC Core supports the kqueue interface provided by Darwin yet, nor XNU kernel supports Linuxy epoll system call, making poll on macOS the only option anyway.

@dhermes
Copy link
Contributor Author

dhermes commented Jan 3, 2018

@mehrdada Thanks for stopping by, I really appreciate it! The real issue this is tracking is what grpc/grpc#13665 tried to fix, but it's all valuable information.

@MaxDesiatov
Copy link

MaxDesiatov commented Jan 8, 2018

I'm seeing constant 100% usage with grpcio==1.8.3 and google-cloud-pubsub==0.30.1, cpython 3.6.4, CoreOS from GKE

@dhermes
Copy link
Contributor Author

dhermes commented Jan 8, 2018

@explicitcall Is that when installing grpcio from a wheel or from source (see why you really want to install from source)?

@MaxDesiatov
Copy link

@dhermes just a plain pipenv install, from a wheel I assume

@MaxDesiatov
Copy link

I'm also confused if I should still install from source when this issue is closed already grpc/grpc#13906

@dhermes
Copy link
Contributor Author

dhermes commented Jan 9, 2018

I'm also confused if I should still install from source when this issue is closed already grpc/grpc#13906

Yes you should install from source. I'm not a pipenv user, so I'm not sure how to pass --ignore-installed and --no-binary=grpcio to pipenv install. You may also be happier just building a wheel once for your machine / image and then installing from that local wheel.

@MaxDesiatov
Copy link

Thanks for the clarification. Is there an issue to track or a certain grpc release to wait for until this wheel/source issue is resolved after grpc/grpc#13906 is closed?

@dhermes
Copy link
Contributor Author

dhermes commented Jan 9, 2018

/cc @mehrdada Can you weigh in?

@chemelnucfin chemelnucfin self-assigned this Jan 9, 2018
@zburgermeiszter
Copy link

Is there a known old version that is not affected by this bug or alternatively a Dockerfile with a workaround?

@mehrdada
Copy link

mehrdada commented Jan 17, 2018

@explicitcall @dhermes Please note that this is a limitation enforced by PyPI: we simply have no way to use epoll_create1 because manylinux1 glibc does not support it. This is not easily fixable by a gRPC release. I am trying to use epoll_create followed by fcntl to enable some level of epoll support (PR #14041)--we will have to wait and see if it works well because there are other limitations like eventfd that will remain. Note that even then the so called "source/binary" difference will exist, because you would be building on two separate platforms, one with much older glibc, so to get the "best" Python gRPC it's ideal to build from source.

That said, it is my understanding that the CPU issue was resolved (or at least a resolution was tried) in 1.8.4.

@dhermes
Copy link
Contributor Author

dhermes commented Jan 17, 2018

@mehrdada The old glibc should be unrelated to the spinlock bug. If it's not, then that means some code paths still have a bug.

As for the fix that works with epoll, it was rolled back in grpc/grpc#13898 but then was rolled forward in grpc/grpc#13933. I'm not sure if that has been released for 1.8.4.

@jonparrott Is it worth discussing alternate hosting for Ubuntu wheels (and other Linux platforms)?

@theacodes
Copy link
Contributor

@jonparrott Is it worth discussing alternate hosting for Ubuntu wheels (and other Linux platforms)?

That experience is really bad, but it's worth us discussing some alternatives here, as the sentence so to get the "best" Python gRPC it's ideal to build from source. is horrifying.

@mehrdada
Copy link

@dhermes Agreed. To be clear, I was addressing the source/binary distinction. My understanding is that a bug fix was attempted in 1.8.4 (including poll code path). Are you still encountering the same bug on 1.8.4? If yes, please ping the relevant issue on the gRPC tracker again.

@mehrdada
Copy link

mehrdada commented Jan 17, 2018

@jonparrott We are actively trying to improve the situation on the binary side (grpc/grpc#14041), but as far as PyPI packages are concerned, some of those limitations are imposed by PyPI. Perhaps the Python community should work on a manylinux2 specification?

@theacodes
Copy link
Contributor

Moving this to an internal thread for now. :) Please check email?

@tseaver
Copy link
Contributor

tseaver commented Jan 18, 2018

That experience is really bad, but it's worth us discussing some alternatives here, as the sentence so to get the "best" Python gRPC it's ideal to build from source. is horrifying.

Shared binaries are like the waltzing bear: it's not how well he waltzes, but that he waltzes at all. We shouldn't be surprised that the manylinux strategy isn't perfect, as that was a known problem with it at its adoption.

@danoscarmike danoscarmike added release blocking Required feature/issue must be fixed prior to next release. triaged for GA labels Jan 19, 2018
@theacodes
Copy link
Contributor

According to @mehrdada and team, this appears to be resolved by grpcio 1.8.4. Closing, but if it reappears we can ofc re-open.

@mehrdada
Copy link

Yes, please file an issue on gRPC issue tracker (or reopen the one) with new repro if you encountered this again. FYI, since it is relevant to the previous discussions on this thread, we modified the epoll1 IO manager code path in core to rely on epoll_create followed by fcntl when compiled on manylinux1 (instead of epoll_create1), so by default our binary packages should have epoll support enabled by default from 1.9.0rc1, so that should be helpful too.

Eric0329 pushed a commit to VenRaaS/Horae that referenced this issue Jan 26, 2018
due to CPU eventually spikes to 100%
see googleapis/google-cloud-python#4600
- apply client.projects().subscriptions().pull() to pull messages
- change message object to message dictionary (hmessage.py)
- updated message handling statements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. release blocking Required feature/issue must be fixed prior to next release. triaged for GA type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

10 participants