-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix panic when max_span_count is reached, add counter metric #104
Conversation
After fixing the panic, I'm now seeing an issue where the At this point I'm assuming the indefinite block on |
I simplified the
|
I've been running this for a few days now in a test environment where clickhouse is intentionally still backed up, and it's been holding stable. |
Panic seen in `ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0` with `log-level=debug`: ``` panic: undefined type *clickhousespanstore.WriteWorker return from workerHeap goroutine 20 [running]: github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).CleanWorkers(0xc00020c300, 0xc00008eefc) github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:95 +0x199 github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).Work(0xc00020c300) github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:50 +0x15e created by github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*SpanWriter).backgroundWriter github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/writer.go:89 +0x226 ``` Also adds metric counter and logging to surface when things are hitting backpressure. Signed-off-by: Nick Parker <[email protected]>
Signed-off-by: Nick Parker <[email protected]>
The current limit logic can result in a stall where `worker.CLose()` never returns due to errors being returned from ClickHouse. This switches to a simpler system of discarding new work when the limit is reached, ensuring that we don't get backed up indefinitely in the event of a long outage. Also moves the count of pending spans to the parent pool: - Avoids race conditions where new work can be started before it's added to the count - Mutexing around the count is no longer needed Signed-off-by: Nick Parker <[email protected]>
Signed-off-by: Nick Parker <[email protected]>
Signed-off-by: Nick Parker <[email protected]>
Panic seen in
ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0
withlog-level=debug
:Also adds metric counter and logging to surface when things are hitting backpressure.
Signed-off-by: Nick Parker [email protected]
Which problem is this PR solving?
Short description of the changes