Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[connector/servicegraph] failed label does not work leads to servicegraph metrics error #32018

Closed
Frapschen opened this issue Mar 28, 2024 · 3 comments · Fixed by #32019
Closed
Assignees
Labels
bug Something isn't working connector/servicegraph

Comments

@Frapschen
Copy link
Contributor

Component(s)

connector/servicegraph

What happened?

Description

The failed label fails to distinguish the succeed and failed edge in servicegraph metrics.

I found this weird graph metrics:
image

The two metrics have the same label set except failed, their values are very close(during six hours) which is impossible.

Besides, the traces_service_graph_request_total contains a label failed=true which also looks like a bug.

After reading the component code, I found the failed dimension doesn't join into the metricKey:

func (p *serviceGraphConnector) buildMetricKey(clientName, serverName, connectionType string, edgeDimensions map[string]string) string {
var metricKey strings.Builder
metricKey.WriteString(clientName + metricKeySeparator + serverName + metricKeySeparator + connectionType)
for _, dimName := range p.config.Dimensions {
dim, ok := edgeDimensions[dimName]
if !ok {
continue
}
metricKey.WriteString(metricKeySeparator + dim)
}
return metricKey.String()
}

It will lead the component to get the wrong label set when it tries to collect metrics:

dimensions, ok := p.dimensionsForSeries(key)

I use a unit test to demonstrate it: test failed label not work

In the unit test, I simulate two services' traces data: foo and bar. foo called the bar three times, two successful, and one failed.
I expect those trace simple will generate graph metrics:

traces_service_graph_request_total{client="foo", server="bar", connection_type="", failed="false"} 2
traces_service_graph_request_total{client="foo", server="bar", connection_type="", failed="true"} 1
...

however the component result in: error metrics:

error metrics content
resourceMetrics:
  - resource: {}
    scopeMetrics:
      - metrics:
          - name: traces_service_graph_request_total
            sum:
              aggregationTemporality: 2
              dataPoints:
                - asInt: "3"
                  attributes:
                    - key: client
                      value:
                        stringValue: foo
                    - key: connection_type
                      value:
                        stringValue: ""
                    - key: failed
                      value:
                        boolValue: false
                    - key: server
                      value:
                        stringValue: bar
                  startTimeUnixNano: "1000000"
                  timeUnixNano: "2000000"
              isMonotonic: true
          - name: traces_service_graph_request_failed_total
            sum:
              aggregationTemporality: 2
              dataPoints:
                - asInt: "1"
                  attributes:
                    - key: client
                      value:
                        stringValue: foo
                    - key: connection_type
                      value:
                        stringValue: ""
                    - key: failed
                      value:
                        boolValue: false
                    - key: server
                      value:
                        stringValue: bar
                  startTimeUnixNano: "1000000"
                  timeUnixNano: "2000000"
              isMonotonic: true

In detail, The key problem is that the metricKey misses the failed label and generates a key that will refer to different values in some cases.

I can demonstrate it:

firstly, assume this is the first span to go through the connector, an edge finish with this values(without error): e.ClientService=foo, e.ServerService=bar,e.ConnectionType=, e.Failed=false

its metricKey will be foobar, then the key refers to its dimensions(stored in a keyToMetric map): {client:foo, server:bar, connection_type: , failed: false}

currently, the reqTotal will be {"foobar": 1}, after collect metrics, result metrics will be:

traces_service_graph_request_total{client:foo, server:bar, connection_type: , failed: false} 1
...

Then, the second edge finish with this values(contain error): e.ClientService=foo, e.ServerService=bar,e.ConnectionType=, e.Failed=true.

This edge also generates the foobar and its dimensions will be {client:foo, server:bar, connection_type: , failed: true}, after this step, the foobar's value in keyToMetric is overwritten, the bug occurs. Currently, the reqTotal will be {"foobar": 2}, and the reqFailedTotal will be {"foobar": 1}. after collecting, metrics will be:

traces_service_graph_request_total{client:foo, server:bar, connection_type: , failed: true} 2
traces_service_graph_request_failed_total{client:foo, server:bar, connection_type: , failed: true} 1
...

in the metrics backend, you will see:

traces_service_graph_request_total{client:foo, server:bar, connection_type: , failed: false} 1
traces_service_graph_request_total{client:foo, server:bar, connection_type: , failed: true} 2
traces_service_graph_request_failed_total{client:foo, server:bar, connection_type: , failed: true} 1

Collector version

main

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@Frapschen Frapschen added bug Something isn't working needs triage New item requiring triage labels Mar 28, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

cc @t00mas

Copy link
Contributor

github-actions bot commented Jul 1, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/servicegraph
Projects
None yet
2 participants