You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the unit test, I simulate two services' traces data: foo and bar. foo called the bar three times, two successful, and one failed.
I expect those trace simple will generate graph metrics:
In detail, The key problem is that the metricKey misses the failed label and generates a key that will refer to different values in some cases.
I can demonstrate it:
firstly, assume this is the first span to go through the connector, an edge finish with this values(without error): e.ClientService=foo, e.ServerService=bar,e.ConnectionType=, e.Failed=false
its metricKey will be foobar, then the key refers to its dimensions(stored in a keyToMetric map): {client:foo, server:bar, connection_type: , failed: false}
currently, the reqTotal will be {"foobar": 1}, after collect metrics, result metrics will be:
Then, the second edge finish with this values(contain error): e.ClientService=foo, e.ServerService=bar,e.ConnectionType=, e.Failed=true.
This edge also generates the foobar and its dimensions will be {client:foo, server:bar, connection_type: , failed: true}, after this step, the foobar's value in keyToMetric is overwritten, the bug occurs. Currently, the reqTotal will be {"foobar": 2}, and the reqFailedTotal will be {"foobar": 1}. after collecting, metrics will be:
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Component(s)
connector/servicegraph
What happened?
Description
The
failed
label fails to distinguish the succeed and failed edge in servicegraph metrics.I found this weird graph metrics:
The two metrics have the same label set except
failed
, their values are very close(during six hours) which is impossible.Besides, the
traces_service_graph_request_total
contains a labelfailed=true
which also looks like a bug.After reading the component code, I found the
failed
dimension doesn't join into themetricKey
:opentelemetry-collector-contrib/connector/servicegraphconnector/connector.go
Lines 605 to 618 in 0a12ede
It will lead the component to get the wrong label set when it tries to collect metrics:
opentelemetry-collector-contrib/connector/servicegraphconnector/connector.go
Line 505 in 0a12ede
I use a unit test to demonstrate it: test failed label not work
In the unit test, I simulate two services' traces data: foo and bar. foo called the bar three times, two successful, and one failed.
I expect those trace simple will generate graph metrics:
however the component result in: error metrics:
error metrics content
In detail, The key problem is that the
metricKey
misses thefailed
label and generates a key that will refer to different values in some cases.I can demonstrate it:
firstly, assume this is the first span to go through the connector, an edge finish with this values(without error):
e.ClientService=foo, e.ServerService=bar,e.ConnectionType=, e.Failed=false
its metricKey will be
foobar
, then the key refers to its dimensions(stored in a keyToMetric map):{client:foo, server:bar, connection_type: , failed: false}
currently, the
reqTotal
will be{"foobar": 1}
, after collect metrics, result metrics will be:Then, the second edge finish with this values(contain error):
e.ClientService=foo, e.ServerService=bar,e.ConnectionType=, e.Failed=true
.This edge also generates the
foobar
and its dimensions will be{client:foo, server:bar, connection_type: , failed: true}
, after this step, thefoobar
's value in keyToMetric is overwritten, the bug occurs. Currently, thereqTotal
will be{"foobar": 2}
, and thereqFailedTotal
will be{"foobar": 1}
. after collecting, metrics will be:in the metrics backend, you will see:
Collector version
main
Environment information
Environment
OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: