Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingester panics when no messages is processed in 1 minute #1125

Closed
marqc opened this issue Oct 17, 2018 · 3 comments
Closed

Ingester panics when no messages is processed in 1 minute #1125

marqc opened this issue Oct 17, 2018 · 3 comments

Comments

@marqc
Copy link
Contributor

marqc commented Oct 17, 2018

Requirement - what kind of business use case are you trying to solve?

Forward traces from kafka storage to elasticsearch storage with jaeger-ingester deployed to kubernetes-cluster. There are periods when traced system is not used and no traces are being recorded.

Problem - what in Jaeger blocks you from solving the requirement?

When ingester does not process any message for Time.Minute process dies with:

{"level":"panic","ts":1539766162.6593273,"caller":"consumer/deadlock_detector.go:69","msg":"No messages processed in the last check interval"

That stops docker container. Kubernetes brings it back, but with exponential backoff. Each restart makes pauses longer.
If not using kubernetes or some systemd manager, that keeps starting ingester it would just die and stay that way.

Proposal - what do you suggest to solve the problem or improve the existing situation?

  • Trust kafka client (sarama) build in fail detection mechanism. Possibly in combination with exposing kafka consumer options (like read message timeouts) to be configurable.
  • Restart consumer within running process (build new consumer and bootstrap app without exiting).
  • Expose configuration for deadlock_detector tick duration - does not solve a problem, but helps managing impact on business.

Any open questions to address

Which of proposed solutions is best?

@yurishkuro
Copy link
Member

This was done to address a number of issues with the sarama lib (#1052) until they are fixed upstream.

I think exposing the timeout setting via cli/config is a good idea, and setting it to 0 should turn off the self-termination behavior.

@yurishkuro
Copy link
Member

@marqc are you interested in creating a pull request to fix that?

Also, consider adding yourself to #207 .

@marqc
Copy link
Contributor Author

marqc commented Oct 19, 2018

@yurishkuro Yes, I will take a look in next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants