[k8s] The k8s integration tests are failing #33520

TylerHelmuth · 2024-06-12T20:17:51Z

Component(s)

processor/k8sattributes, receiver/k8scluster, receiver/k8sobjects, receiver/kubeletstats

Describe the issue you're reporting

The k8s integration tests have started failing. See https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/workflows/e2e-tests.yml?query=branch%3Amain.

TylerHelmuth · 2024-06-12T20:18:14Z

I have been unable to reproduce the issues locally and reverting #33415 did not help (according to the CI jobs on main that was the first commit where things started to flake).

Looking at the workflow it looks like all versions are pinned so I don't think we suddenly started using some new action, kind versions, etc.

github-actions · 2024-06-12T20:19:03Z

Pinging code owners for internal/k8stest: @crobert-1. See Adding Labels via Comments if you do not have permissions to add labels yourself.

TylerHelmuth · 2024-06-12T20:22:34Z

@jinja2 @fatsheep9146 any guesses?

github-actions · 2024-06-12T20:23:05Z

Pinging code owners for receiver/k8sobjects: @dmitryax @hvaghani221 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-06-12T20:23:07Z

Pinging code owners for processor/k8sattributes: @dmitryax @rmfitzpatrick @fatsheep9146 @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-06-12T20:23:08Z

Pinging code owners for receiver/k8scluster: @dmitryax @TylerHelmuth @povilasv. See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-06-12T20:23:09Z

Pinging code owners for receiver/kubeletstats: @dmitryax @TylerHelmuth. See Adding Labels via Comments if you do not have permissions to add labels yourself.

Related to #33520

axw · 2024-06-13T08:18:56Z

I had reproduced the error locally yesterday (or at least something that looked the same), but had to switch focus before I could find the root cause. Now I can't reproduce it :(

One thing I did notice was in the collector logs there were errors about not being able to connect to kind-control-plane. Perhaps the e2e workflow should capture the pod logs before tearing down, to make debugging easier.

fatsheep9146 · 2024-06-13T08:33:23Z

I had reproduced the error locally yesterday (or at least something that looked the same), but had to switch focus before I could find the root cause. Now I can't reproduce it :(

One thing I did notice was in the collector logs there were errors about not being able to connect to kind-control-plane. Perhaps the e2e workflow should capture the pod logs before tearing down, to make debugging easier.

I also could not reproduce the same error like github action, it's really weird. But your advise is really good to capture the logs of pod (no matter collector or telemetrygen) in workflow to help debugging. @axw

ChrsMark · 2024-06-13T09:31:42Z

Not sure if there is another way to get access to the Pods' logs but I tried sth dirty to capture the logs of the Pods: #33538.
Let's see if this can provide us some insights here.

ChrsMark · 2024-06-13T10:03:04Z

Got some interesting "connection refused" errors: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9497224255/job/26173693278?pr=33538#step:11:225

2024-06-13T09:44:56.953Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "7.546970563s"}
2024-06-13T09:44:57.064Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "logs", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "7.612004411s"}
2024-06-13T09:44:57.486Z	info	exporterhelper/retry_sender.go:118	Exporting failed. Will retry the request after interval.	{"kind": "exporter", "data_type": "traces", "name": "otlp", "error": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\"", "interval": "6.403460654s"}

fatsheep9146 · 2024-06-13T11:07:40Z

@ChrsMark
Seems that the logic of getting hostEndpoint is the root cause, and this logic is different between mac and linux.

fatsheep9146 · 2024-06-13T11:12:29Z

@ChrsMark In your latest pr, the hostEndpoint is empty, I think this is the root cause.
https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9498361987/job/26177138997?pr=33538

fatsheep9146 · 2024-06-13T11:47:21Z

I suspect the reason is due to the https://github.com/actions/runner-images/pull/10039/files.
The os ubuntu-latest we use in github action updated with new version of docker.

ChrsMark · 2024-06-13T12:38:28Z

Sounds possible @fatsheep9146, I will try to upgrade docker on my machine to 26.x.x as well and see if I can reproduce it.

Update:

I was able to reproduce this locally with docker 26.1.4 (ubuntu machine).
Collector Pod logs:

2024-06-13T12:42:55.052Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #2 SubChannel #8]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.052Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #2 SubChannel #8]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.316Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #1 SubChannel #9]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:55.316Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #1 SubChannel #9]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:57.265Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #4 SubChannel #11]grpc: addrConn.createTransport failed to connect to {Addr: "127.0.0.1:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"	{"grpc_log": true}
2024-06-13T12:42:57.265Z	warn	zapgrpc/zapgrpc.go:193	[core] [Channel #4 SubChannel #11]grpc: addrConn.createTransport failed to connect to {Addr: "[::1]:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp [::1]:4317: connect: connection refused"	{"grpc_log": true}

fatsheep9146 · 2024-06-13T13:02:10Z

@ChrsMark I'm trying to update the sdk version docker to see if it can fix the problem.

ChrsMark · 2024-06-13T13:11:20Z

@fatsheep9146 thank's! FYI debugging this, I spot that

opentelemetry-collector-contrib/internal/k8stest/k8s_data_helpers.go

Line 28 in 6b1d3dd

    
           network, err := client.NetworkInspect(ctx, "kind", types.NetworkInspectOptions{})

is failing with context deadline exceeded, but the weird thing is that this error is for some reason "muted".

Hopefully the lib upgrade can solve this.

ChrsMark · 2024-06-13T14:20:49Z

I had a successful run at #33548. I'm going to enable the rest of the tests and check again.

fatsheep9146 · 2024-06-13T14:41:00Z

I had a successful run at #33548. I'm going to enable the rest of the tests and check again.
@ChrsMark

Yes, I found update docker sdk library is blocked by for some reasons.
#32614
#31989

So I also try to use another way to get the right host endpoint
https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501492668/job/26187569925?pr=33542

I think we can try in both ways and get more opnions from others.

ChrsMark · 2024-06-13T15:06:00Z

I hit an additional error at k8scluster receiver. It seems that some image names have also changed as well: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501435003/job/26187307997?pr=33548#step:11:35

potential fix: c87a639

fatsheep9146 · 2024-06-13T15:08:05Z

I hit an additional error at k8scluster receiver. It seems that some image names have also changed as well: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/9501435003/job/26187307997?pr=33548#step:11:35

potential fix: c87a639

I think this maybe due to the newer version of kind

ChrsMark · 2024-06-13T15:45:52Z

@fatsheep9146 e2e tests passed at #33548. I'm opening that one for review since it offers a fix anyways. I'll be out tomorrow (Friday) so feel free to pick the gateway check and proceed with yours if people find the approach more suitable. I'm fine either way as soon as we solve the issue :).

@fatsheep9146

**Description:** <Describe what has changed.>  Only return address that is not empty for `kind` network. This started affecting the e2e tests possibly because of the `ubuntu-latest`'s docker version update that is mentioned at #33520 (comment). Relates to #33520. /cc @fatsheep9146 Sample `kind` network: ```console curl --unix-socket /run/docker.sock http://docker/networks/kind | jq % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 841 100 841 0 0 821k 0 --:--:-- --:--:-- --:--:-- 821k { "Name": "kind", "Id": "801d2abe204253cbd5d1d135f111a7fb386b830382bde79a699fb4f9aaf674b1", "Created": "2024-06-13T15:31:57.738509232+03:00", "Scope": "local", "Driver": "bridge", "EnableIPv6": true, "IPAM": { "Driver": "default", "Options": {}, "Config": [ { "Subnet": "fc00:f853:ccd:e793::/64" }, { "Subnet": "172.18.0.0/16", "Gateway": "172.18.0.1" } ] }, "Internal": false, "Attachable": false, "Ingress": false, "ConfigFrom": { "Network": "" }, "ConfigOnly": false, "Containers": { "db113750635782bc1bfdf31e5f62af3c63f02a9c8844f7fe9ef045b5d9b76d12": { "Name": "kind-control-plane", "EndpointID": "8b15bb391109ca1ecfbb4bf7a96060b01e3913694d34e23d67eec22684f037bb", "MacAddress": "02:42:ac:12:00:02", "IPv4Address": "172.18.0.2/16", "IPv6Address": "fc00:f853:ccd:e793::2/64" } }, "Options": { "com.docker.network.bridge.enable_ip_masquerade": "true", "com.docker.network.driver.mtu": "1500" }, "Labels": {} } ``` **Link to tracking Issue:** <Issue number if applicable> **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.> --------- Signed-off-by: ChrsMark <[email protected]>

crobert-1 · 2024-06-13T20:49:58Z

Resolved by #33548

crobert-1 · 2024-06-13T21:28:37Z

Thanks for addressing and fixing so quickly @ChrsMark and @fatsheep9146!

TylerHelmuth added help wanted Extra attention is needed ci-cd CI, CD, testing, build issues flaky test a test is flaky internal/k8stest labels Jun 12, 2024

TylerHelmuth added processor/k8sattributes k8s Attributes processor receiver/k8scluster receiver/k8sobjects receiver/kubeletstats labels Jun 12, 2024

TylerHelmuth mentioned this issue Jun 12, 2024

[k8s] Skip k8s e2e #33521

Merged

TylerHelmuth added the release:blocker The issue must be resolved before cutting the next release label Jun 12, 2024

codeboten mentioned this issue Jun 12, 2024

Updates to OTel-Arrow v0.24.0 deps #33518

Merged

jmacd mentioned this issue Jun 12, 2024

[probabilistic sampling processor] encoded sampling probability (support OTEP 235) #31894

Merged

TylerHelmuth added a commit that referenced this issue Jun 12, 2024

[k8s] Skip k8s e2e (#33521)

d3873bb

Related to #33520

rinx mentioned this issue Jun 13, 2024

[extension/googleclientauth] Update github.com/GoogleCloudPlatform/opentelemetry-operations-go v0.48.0 #33493

Merged

ChrsMark mentioned this issue Jun 13, 2024

Try to get logs of telemetrygen #33538

Closed

ChrsMark mentioned this issue Jun 13, 2024

[chore] [k8s] fix k8s e2e tests #33548

Merged

fatsheep9146 mentioned this issue Jun 13, 2024

fix e2e failed test #33542

Closed

crobert-1 closed this as completed Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] The k8s integration tests are failing #33520

[k8s] The k8s integration tests are failing #33520

TylerHelmuth commented Jun 12, 2024

TylerHelmuth commented Jun 12, 2024 •

edited

Loading

github-actions bot commented Jun 12, 2024

TylerHelmuth commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

axw commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024 •

edited

Loading

ChrsMark commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024 •

edited

Loading

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024

ChrsMark commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024

crobert-1 commented Jun 13, 2024

crobert-1 commented Jun 13, 2024

[k8s] The k8s integration tests are failing #33520

[k8s] The k8s integration tests are failing #33520

Comments

TylerHelmuth commented Jun 12, 2024

Component(s)

Describe the issue you're reporting

TylerHelmuth commented Jun 12, 2024 • edited Loading

github-actions bot commented Jun 12, 2024

TylerHelmuth commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

github-actions bot commented Jun 12, 2024

axw commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024 • edited Loading

ChrsMark commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024 • edited Loading

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024

ChrsMark commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024

fatsheep9146 commented Jun 13, 2024

ChrsMark commented Jun 13, 2024

crobert-1 commented Jun 13, 2024

crobert-1 commented Jun 13, 2024

TylerHelmuth commented Jun 12, 2024 •

edited

Loading

ChrsMark commented Jun 13, 2024 •

edited

Loading

ChrsMark commented Jun 13, 2024 •

edited

Loading