-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: jaeger ingester module nil pointer error (when consume to ES) #3829
Comments
The repro script doesn't work for me
(this is after I removed unnecessary copies of ES and Query). It's also missing Kafka. What is producing the actual traces in your setup? |
I have a program to produce spans & concat them with a trace ID based on our business logic, then I send the spans to jaeger-collector by http. At first, I use jaeger-all-in-one with memory storage. The container restarted too often, obviously, it can't be used in production. So I want to use es cluster to persist the data.
I use aws kafka cluster, so it's not in docker-compose file. I just checked the code, can we do this? func (fd FromDomain) convertSpanEmbedProcess(span *model.Span) *Span {
s := fd.convertSpanInternal(span)
if span.Process == nil {
return nil
}
s.Process = fd.convertProcess(span.Process)
s.References = fd.convertReferences(span)
return &s
} |
I am willing to help to look into the issue, how can I start |
We already have a fix being worked on in #3819.
Can you share this program code? I am still not clear how the nil pointer gets into the data received by the ingester, because the same sanitizer being added in #3819 already exists in the collector. In other words, it should not be possible to have span data in Kafka with |
There are two programs, the first one produces JSON like this, and send to Kafka {
"header": {
"k0": "v0",
"k1": "v1"
},
"id": "512506985abb01b6",
"parentId": "2a5b6961c59e731f",
"name": "spanName_foo",
"start": 1658247482737891800,
"end": 1658247482737892900,
"metadata": {
"k0": "v0",
"k1": "v1"
}
} The second one converts it to otel format spans & concat them based on business rules. Here's the sample json output, all of our production spans is like this https://gist.github.com/huahuayu/cd3ad1ddf076b2892a6c3c1c68a9ca34 |
But how does this get into Jaeger? Do you then take the OTEL spans and submit then to Jaeger Collector's OTLP endpoint? |
yes, you are right, sent it to otel-collector by http. |
Please check my sample link again, I provided more data, exactly the same production data format( I send spans in batch), and you can do this:
My data should be no problem, I have confidence in that. If the data constraint breaks, maybe it's been introduced in other jaeger components. |
This is what I am getting from your description. flowchart LR
A[App] --> |Custom\nJSON|Kafka
Kafka --> C[Custom\nConverter]
C --> |OTLP JSON|JC[Jaeger\nCollector]
JC --> |??|Kafka2
Kafka2 --> JI[Jaeger\nIngester]
Btw, when I run your curl command against Jaeger-all-in-one with OTLP enabled, if fails: $ curl --location --request POST 'http://127.0.0.1:4318/v1/traces' \
--header 'Content-Type: application/json' \
--data-raw ~/Downloads/otel-trace.json
{"code":3,"message":"invalid character '/' looking for beginning of value"}% |
Hey @yurishkuro , thanks for the update on the PR and the review! We have the same setup as @huahuayu and it only happens when we upgraded the Jaeger, ElasticSearch and Kafka. I believe it went wrong at Jaeger Collector to Kafka step or during the upgrade. If someone was looking for a workaround - it was purging the messages from the topic on Kafka. kubectl exec -it -n <jaeger-namespace> jaeger-kafka-0 -- sh
cd /opt/bitnami/kafka/bin
./kafka-configs.sh --bootstrap-server jaeger-kafka:9092 --topic jaeger-spans --alter --add-config retention.ms=1000
./kafka-configs.sh --bootstrap-server jaeger-kafka:9092 --topic jaeger-spans --alter --delete-config retention.ms |
So all the fixes we have so far are defensive. Still don't know how messages existing collector->Kafka or entering Kafka->ingested might end up with nil Process. If it's due to data corruption in Kafka, it seems very peculiar that it's always the nil Process that people experience (although it's very possible that because we never checked for nil there, it's just the most obvious symptom). I just merged the fix #3578 that should prevent panics, maybe then someone can investigate what the spans look like when stored, i.e. maybe some other parts of the span are corrupted. |
@locmai did you notice any unusual logs in the collector during Kafka upgrade? Other than Kafka upgrade corrupting the data already stored (possible, I guess), Collector->Kafka is indeed a likely place, possibly due to the type of driver we're using. |
I don't think we've seen anything unusual from Kafka logs during the upgrade. Let me double-check on it tomorrow. We did the upgrade for 10 environments and 3/10 had the issue, so yes, very likely due to the Kafka upgrade. |
@yurishkuro the process flow is right. The data have no problem, your curl should use plain JSON text not the json file path, and it will be no error.
|
@locmai I didn't upgrade any components, I use jaeger 1.35 all the time |
This works fine with all-in-one: $ curl --location --request POST 'http://127.0.0.1:4318/v1/traces' \
--header 'Content-Type: application/json' \
--data-binary @~/Downloads/otel-trace.json @huahuayu are you saying that sending the same data payload through collector->kafka->ingester pipeline reliably reproduces the panic for you? We need an easier way to setup this test, currently it's only automated via GH action. @huahuayu also, which encoding of Jaeger spans are you using in Kafka? |
The sample json I gave can not reproduce the panic, what I want to say is the original span data have no problem.
I don't know, it's not json, but it's not profobuf either, it's been automatically created by jaeger I think, the topic name is jaeger-spans, the msg is like
|
Jaeger only supports JSON and Protobuf, so it has to be the latter. |
I just pushed a docker-compose config for running collector->Kafka->ingester pipeline (7006e9f) and tested with the trace from https://gist.github.com/huahuayu/cd3ad1ddf076b2892a6c3c1c68a9ca34:
As I expected, the trace was stored without issues. So the issue is somewhere in the interaction between Jaeger and Kafka. Since the messages appear to be corrupted when stored in Kafka, my first suspicion is that it happens in the producer (jaeger collector) or the driver during Kafka broker maintenance. Unfortunately, I don't have much to go on without the ability to reproduce the issue. |
This could be fixed in v2, but cannot reproduce so closing but also adding the tag. |
What happened?
Start jaeger container with ES as data storage, the ingester container keeps down, nil pointer error.
Version 1.35 & 1.36 has the same issue.
Steps to reproduce
Step0: run jaeger with es by container
Step1: At first it works fine, I can see data in jaeger-ui, but then
ingester
container keeps down. Error log:Expected behavior
Everything works.
Relevant log output
No response
Screenshot
No response
Additional context
No response
Jaeger backend version
No response
SDK
No response
Pipeline
No response
Stogage backend
No response
Operating system
No response
Deployment model
No response
Deployment configs
No response
The text was updated successfully, but these errors were encountered: