Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gRPC plugin framework should be able to recover from panics #1742

Open
sergiimk opened this issue Aug 19, 2019 · 7 comments
Open

gRPC plugin framework should be able to recover from panics #1742

sergiimk opened this issue Aug 19, 2019 · 7 comments
Labels
area/storage help wanted Features that maintainers are willing to accept but do not have cycles to implement

Comments

@sergiimk
Copy link

Requirement - what kind of business use case are you trying to solve?

We are implementing a custom gRPC-based storage plugin as per this doc.

Problem - what in Jaeger blocks you from solving the requirement?

There are two related problems:

  • When gRPC plugin panics the reason for panic is not shown in Jaeger logs, making the cause very hard to identify (related to: gRPC storage plugins - how to debug? #1529)
  • When plugin panics it is not restarted, so Jaeger enters an unusable state

Impact:

  • Makes it very hard to make any plugin prod-ready
  • Makes developer experience of writing a plugin very frustrating
  • Plugin code becomes littered with defer/recover to prevent it from fully crashing and display helpful debug info

Proposal - what do you suggest to solve the problem or improve the existing situation?

Reason and a stacktrace should be already dumped into stderr of a plugin process at the time of panic, so Jaeger should be able to capture and log it.

Crashed plugins should be restarted.

Any open questions to address

@yurishkuro
Copy link
Member

Plugin framework is just https://github.com/hashicorp/go-plugin

I don't know if it supports recovery from plug-in crashes.

@yurishkuro yurishkuro added area/storage help wanted Features that maintainers are willing to accept but do not have cycles to implement labels Aug 20, 2019
@radekg
Copy link
Contributor

radekg commented Dec 11, 2019

I'd like to pick this one up. I have some experience with this.

@blackbaud-brandonstirnaman

Having some issues with a gRPC storage plugin I'm writing and really struggling to get good reason/stack data out.

@radekg or anyone else -- any suggestions on how to better get visibility?

@jpkrohling
Copy link
Contributor

I think @gouthamve had very similar issues.

@gouthamve: do you have a couple of minutes to share your experience?

@jpkrohling
Copy link
Contributor

jpkrohling commented Jun 10, 2020

Just saw #2071, which might help with this.

@chlunde
Copy link
Contributor

chlunde commented Oct 23, 2020

@radekg what solution were you thinking about?

I'm wondering if it would be best to simply crash/exit with an error code, so k8s/systemd would restart jaeger and normal alerting would detect the issue.

  • We would know that nothing is left behind after the crash, the state would be clean
  • We do not have to think about backoff, delay, metrics or anything like that

On the other hand, since the plugins are isolated, one could argue that we don't have to worry about the first case, and the second case might not be relevant because the process can't do anything useful without a working storage plugin any way.

@radekg
Copy link
Contributor

radekg commented Oct 26, 2020

Hi @chlunde. It's been a while since the last time I looked at this. The go-plugin implements a ping message in the protocol. Apparently that message can be used to find out if the plugin is running. That would have been the route I would have taken.

IMHO, restarting the complete collector seems to be a bit heavy handed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/storage help wanted Features that maintainers are willing to accept but do not have cycles to implement
Projects
None yet
Development

No branches or pull requests

6 participants