Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Globalnet: Ensure that iptable rule is actually deleted before trying ipset deletion #1746

Closed
sridhargaddam opened this issue Mar 22, 2022 · 15 comments · Fixed by #2978
Closed
Assignees
Labels
bug Something isn't working priority:medium

Comments

@sridhargaddam
Copy link
Member

sridhargaddam commented Mar 22, 2022

In a Globalnet deployment on OCP, while running e2e tests, the following errors are noticed.

I0322 10:00:37.535555       1 global_egressip_controller.go:138] Processing created GlobalEgressIP "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg", NumberOfIPs: 1, PodSelector: (*v1.LabelSelector)(nil), Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition(nil), AllocatedIPs:[]string(nil)}
I0322 10:00:37.535594       1 global_egressip_controller.go:197] Allocating 1 global IP(s) for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg"
I0322 10:00:37.535622       1 ipset.go:349] Running ipset [create SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ hash:ip -exist]
I0322 10:00:37.536664       1 iface.go:249] Installing iptable egress rules for Namespace "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg": -p all -m set --match-set SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ src -m mark --mark 0xC0000/0xC0000 -j SNAT --to 242.0.255.253
I0322 10:00:37.541254       1 global_egressip_controller.go:258] Allocated [242.0.255.253] global IP(s) for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg"
I0322 10:00:37.641854       1 global_egressip_controller.go:351] Started pod watcher for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg"
I0322 10:00:37.641883       1 base_controllers.go:202] Updated: &v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2022, time.March, 22, 10, 0, 37, 541247789, time.Local), Reason:"Success", Message:"Allocated 1 global IP(s)"}}, AllocatedIPs:[]string{"242.0.255.253"}}
I0322 10:00:46.408867       1 egress_pod_watcher.go:96] Pod "e2e-tests-dataplane-gn-conn-nd-qlkvg/custom7lc89" with IP 10.128.0.38 created/updated
I0322 10:00:46.408893       1 ipset.go:349] Running ipset [add SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ 10.128.0.38 -exist]
I0322 10:03:59.906507       1 egress_pod_watcher.go:110] Pod "e2e-tests-dataplane-gn-conn-nd-qlkvg/custom7lc89" removed
I0322 10:03:59.906533       1 ipset.go:349] Running ipset [del SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ 10.128.0.38]
I0322 10:04:00.203880       1 global_egressip_controller.go:138] Processing deleted GlobalEgressIP "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg", NumberOfIPs: 1, PodSelector: (*v1.LabelSelector)(nil), Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2022, time.March, 22, 10, 0, 37, 0, time.Local), Reason:"Success", Message:"Allocated 1 global IP(s)"}}, AllocatedIPs:[]string{"242.0.255.253"}}
I0322 10:04:00.203931       1 base_controllers.go:161] Releasing previously allocated IPs [242.0.255.253] for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg"
I0322 10:04:00.203952       1 iface.go:263] Deleting iptable egress rules for Namespace "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg": -p all -m set --match-set SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ src -m mark --mark 0xC0000/0xC0000 -j SNAT --to 242.0.255.253
I0322 10:04:00.206445       1 ipset.go:349] Running ipset [destroy SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ]
E0322 10:04:00.210974       1 global_egressip_controller.go:298] Error destroying the ipSet "SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ" for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg": error destroying set "SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ": exit status 1 (ipset v7.1: Set cannot be destroyed: it is in use by a kernel component
)
I0322 10:04:00.221252       1 global_egressip_controller.go:138] Processing deleted GlobalEgressIP "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg", NumberOfIPs: 1, PodSelector: (*v1.LabelSelector)(nil), Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2022, time.March, 22, 10, 0, 37, 0, time.Local), Reason:"Success", Message:"Allocated 1 global IP(s)"}}, AllocatedIPs:[]string{"242.0.255.253"}}
I0322 10:04:00.221288       1 base_controllers.go:161] Releasing previously allocated IPs [242.0.255.253] for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg"
I0322 10:04:00.221306       1 iface.go:263] Deleting iptable egress rules for Namespace "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg": -p all -m set --match-set SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ src -m mark --mark 0xC0000/0xC0000 -j SNAT --to 242.0.255.253
I0322 10:04:00.223218       1 ipset.go:349] Running ipset [destroy SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ]

Some of the e2e tests create GlobalEgressIPs with PodSelectors. After the test is completed, when the e2e test tries to delete the GlobalEgressIPs, Globalnet controller initially deletes the associated iptable rules and then tries to delete the corresponding ipset. However, its seen that deleting the iptable rule is taking few milliseconds for the actual deletion to happen in the kernel and in the meantime when Globalnet controller tries to delete the ipset, it gets an error message as shown below.

I0322 10:04:00.203880       1 global_egressip_controller.go:138] Processing deleted GlobalEgressIP "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg", NumberOfIPs: 1, PodSelector: (*v1.LabelSelector)(nil), Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2022, time.March, 22, 10, 0, 37, 0, time.Local), Reason:"Success", Message:"Allocated 1 global IP(s)"}}, AllocatedIPs:[]string{"242.0.255.253"}}
I0322 10:04:00.203931       1 base_controllers.go:161] Releasing previously allocated IPs [242.0.255.253] for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg"
I0322 10:04:00.203952       1 iface.go:263] Deleting iptable egress rules for Namespace "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg": -p all -m set --match-set SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ src -m mark --mark 0xC0000/0xC0000 -j SNAT --to 242.0.255.253
I0322 10:04:00.206445       1 ipset.go:349] Running ipset [destroy SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ]
E0322 10:04:00.210974       1 global_egressip_controller.go:298] Error destroying the ipSet "SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ" for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg": error destroying set "SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ": exit status 1 (ipset v7.1: Set cannot be destroyed: it is in use by a kernel component
)

Globalnet controller has a retry mechanism and on the next retry it succeeds.

I0322 10:04:00.221252       1 global_egressip_controller.go:138] Processing deleted GlobalEgressIP "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg", NumberOfIPs: 1, PodSelector: (*v1.LabelSelector)(nil), Status: v1.GlobalEgressIPStatus{Conditions:[]v1.Condition{v1.Condition{Type:"Allocated", Status:"True", ObservedGeneration:0, LastTransitionTime:time.Date(2022, time.March, 22, 10, 0, 37, 0, time.Local), Reason:"Success", Message:"Allocated 1 global IP(s)"}}, AllocatedIPs:[]string{"242.0.255.253"}}
I0322 10:04:00.221288       1 base_controllers.go:161] Releasing previously allocated IPs [242.0.255.253] for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg"
I0322 10:04:00.221306       1 iface.go:263] Deleting iptable egress rules for Namespace "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg": -p all -m set --match-set SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ src -m mark --mark 0xC0000/0xC0000 -j SNAT --to 242.0.255.253
I0322 10:04:00.223218       1 ipset.go:349] Running ipset [destroy SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ]
I0322 10:04:00.226942       1 global_egressip_controller.go:305] Successfully deleted all the iptables/ipset rules for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg"

It would be good if we can figure out a way to ensure that iptable rule is actually deleted before trying to delete the ipset in the first iteration itself as otherwise the error logs could be misleading while debugging any real issue.

@sridhargaddam sridhargaddam added the bug Something isn't working label Mar 22, 2022
@stale
Copy link

stale bot commented Jul 30, 2022

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Jul 30, 2022
@tpantelis tpantelis removed the wontfix This will not be worked on label Jul 31, 2022
@stale
Copy link

stale bot commented Dec 21, 2022

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Dec 21, 2022
@stale stale bot closed this as completed Dec 31, 2022
@skitt skitt reopened this Jan 2, 2023
@stale stale bot removed the wontfix This will not be worked on label Jan 2, 2023
@cyclinder
Copy link
Contributor

Maybe I can work on this :)

@skitt
Copy link
Member

skitt commented Feb 28, 2023

Maybe I can work on this :)

Excellent, go for it!

@dfarrell07
Copy link
Member

Maybe-related: #2447

@stale
Copy link

stale bot commented Sep 17, 2023

This issue has been automatically marked as stale because it has not had activity for 60 days. It will be closed if no further activity occurs. Please make a comment if this issue/pr is still valid. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 17, 2023
@dfarrell07
Copy link
Member

@cyclinder are you still planning on working on this?

@dfarrell07 dfarrell07 removed the wontfix This will not be worked on label Oct 24, 2023
@cyclinder
Copy link
Contributor

Hi @dfarrell07, I'm sorry for the delay, I thought this issue had been fixed, I can work on this next weekend, I'm busy with work these days...

@tpantelis tpantelis assigned tpantelis and unassigned cyclinder Feb 5, 2024
@cyclinder
Copy link
Contributor

cyclinder commented Feb 21, 2024

Hi @sridhargaddam @dfarrell07, Sorry for the delay, I'm working on this.

the iptable rules are actually deleting, so ipset can't destroy. I think we just need sleep or retry, refer to weaveworks/weave#3882, WDYT?

@sridhargaddam
Copy link
Member Author

Hi @sridhargaddam @dfarrell07, Sorry for the delay, I'm working on this.

the iptable rules are actually deleting, so ipset can't destroy. I think we just need sleep or retry, refer to weaveworks/weave#3882, WDYT?

We already have a retry mechanism in the code. Wrt adding sleep, I dont think its a good idea.
@yboaron are we still seeing these errors in an OCP deployment?

@cyclinder
Copy link
Contributor

@sridhargaddam where is the retry mechanism in the code? I haven't found it, Can you tell me more details? thanks.

@yboaron
Copy link
Contributor

yboaron commented Feb 21, 2024

We use resource syncer mechanism from Admiral project , one of the fields is Transform, Transform is pointer to Function that returns Boolean if event needs to be requeued

For GlobalNet egress we set resource syncer Trasnform function to controller.process

controller.process Calls to onDelete function

if Delete fails we return True to requeue

@cyclinder
Copy link
Contributor

Understood, thanks for the explanation @yboaron

we can eventually delete all these rules, If we don't want the logs could be misleading, a simple way is to change the logs, something like:

Error destroying the ipSet "SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ" for "e2e-tests-dataplane-gn-conn-nd-qlkvg/test-e2e-egressip-e2e-tests-dataplane-gn-conn-nd-qlkvg": error destroying set "SM-GN-BTG66WLFVD3ZAKHSL7T5RIVDJ": exit status 1 (ipset v7.1: Set cannot be destroyed: it is in use by a kernel component), retrying...

@yboaron
Copy link
Contributor

yboaron commented Feb 21, 2024

are we still seeing these errors in an OCP deployment?

I didn't see it recently

tpantelis added a commit to tpantelis/submariner that referenced this issue Apr 15, 2024
This handles the case where a set is destroyed right after associated
IP table rule(s) are deleted where the latter may not be complete yet.
Retry up tp 2 seconds.

Fixes submariner-io#1746

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit to tpantelis/submariner that referenced this issue Apr 15, 2024
This handles the case where a set is destroyed right after associated
IP table rule(s) are deleted where the latter may not be complete yet.
Retry up to 2 seconds.

Fixes submariner-io#1746

Signed-off-by: Tom Pantelis <[email protected]>
@tpantelis
Copy link
Contributor

I think it's reasonable to internally retry in ipset.DestroySet if a transient "is in use" error occurs to avoid intermittent errors in the log. A maximum of 2 seconds seems safely sufficient based on the timing in the logs above.

tpantelis added a commit to tpantelis/submariner that referenced this issue Apr 17, 2024
This handles the case where a set is destroyed right after associated
IP table rule(s) are deleted where the latter may not be complete yet.
Retry up to 2 seconds.

Fixes submariner-io#1746

Signed-off-by: Tom Pantelis <[email protected]>
tpantelis added a commit that referenced this issue Apr 18, 2024
This handles the case where a set is destroyed right after associated
IP table rule(s) are deleted where the latter may not be complete yet.
Retry up to 2 seconds.

Fixes #1746

Signed-off-by: Tom Pantelis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:medium
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

6 participants