Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic when max_span_count is reached, add counter metric #104

Merged
merged 5 commits into from
Jan 25, 2022
Merged

Fix panic when max_span_count is reached, add counter metric #104

merged 5 commits into from
Jan 25, 2022

Conversation

nickbp
Copy link
Contributor

@nickbp nickbp commented Jan 19, 2022

Panic seen in ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0 with log-level=debug:

panic: undefined type *clickhousespanstore.WriteWorker return from workerHeap

goroutine 20 [running]:
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).CleanWorkers(0xc00020c300, 0xc00008eefc)
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:95 +0x199
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).Work(0xc00020c300)
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:50 +0x15e
created by github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*SpanWriter).backgroundWriter
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/writer.go:89 +0x226

Also adds metric counter and logging to surface when things are hitting backpressure.

Signed-off-by: Nick Parker [email protected]

Which problem is this PR solving?

  • Fixes panic when recently added max_span_count flush is being exercised

Short description of the changes

  • Updates case type so that the panic no longer occurs
  • Adds debug log and metric counter to allow monitoring of the setting, since reaching the limit may indicate backpressure/problems with writing spans to ClickHouse. (In my case I was seeing the panic when ClickHouse's disk was full)

@nickbp
Copy link
Contributor Author

nickbp commented Jan 20, 2022

After fixing the panic, I'm now seeing an issue where the worker.CLose() call blocks indefinitely and eventually jaeger-clickhouse stops logging any progress. I'm assuming that this is a deadlock somewhere in that CLose() logic, which eventually halts all progress. I guessed that it might be due to a double lock on the mutex and tried fixing that in 935ff78. But with that fix I'm still seeing the deadlock behavior, where the expected Existing batch finished message is never produced.

At this point I'm assuming the indefinite block on CLose() is due to an issue in the worker's internal sync/messaging handling. I'm thinking an easy way of resolving the situation would be to just discard new batches when the limit is being exceeded (while logging a warning/increasing a counter), rather than blocking on existing batches to finish. This then avoids the need for this special handling.

@nickbp
Copy link
Contributor Author

nickbp commented Jan 21, 2022

I simplified the max_span_count handling, and am no longer seeing the earlier stalled behavior when the max was reached:

  • When max_span_count is exceeded, switched to discarding data (and recording it in log+metric), rather than waiting potentially indefinitely(!) for the existing workers to finish their current work
  • Moved pending span counter into pool, removing the need for mutexing
  • Treated max_span_count<=0 configuration as unlimited

@nickbp nickbp mentioned this pull request Jan 24, 2022
@nickbp
Copy link
Contributor Author

nickbp commented Jan 25, 2022

I've been running this for a few days now in a test environment where clickhouse is intentionally still backed up, and it's been holding stable.

Nick Parker added 5 commits January 26, 2022 08:42
Panic seen in `ghcr.io/jaegertracing/jaeger-clickhouse:0.8.0` with `log-level=debug`:
```
panic: undefined type *clickhousespanstore.WriteWorker return from workerHeap

goroutine 20 [running]:
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).CleanWorkers(0xc00020c300, 0xc00008eefc)
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:95 +0x199
github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*WriteWorkerPool).Work(0xc00020c300)
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/pool.go:50 +0x15e
created by github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore.(*SpanWriter).backgroundWriter
	github.com/jaegertracing/jaeger-clickhouse/storage/clickhousespanstore/writer.go:89 +0x226
```

Also adds metric counter and logging to surface when things are hitting backpressure.

Signed-off-by: Nick Parker <[email protected]>
The current limit logic can result in a stall where `worker.CLose()` never returns due to errors being returned from ClickHouse.
This switches to a simpler system of discarding new work when the limit is reached, ensuring that we don't get backed up indefinitely in the event of a long outage.

Also moves the count of pending spans to the parent pool:
- Avoids race conditions where new work can be started before it's added to the count
- Mutexing around the count is no longer needed

Signed-off-by: Nick Parker <[email protected]>
Signed-off-by: Nick Parker <[email protected]>
@nickbp nickbp merged commit 34ad236 into jaegertracing:main Jan 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant