Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to set/get socket timeout option for default handle #614

Conversation

ArthurChiao
Copy link
Contributor

Ref: #613

Signed-off-by: arthurchiao [email protected]

@brb
Copy link
Contributor

brb commented Feb 1, 2021

Hey @aboch , could you take a look?

handle_linux.go Outdated
// Empty handle used by the netlink package methods
var pkgHandle = &Handle{}
// Default handle used by the netlink package methods
var pkgHandle *Handle = nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to make this a pointer?
Can't we still set the socket TO in init()?

Copy link
Contributor Author

@ArthurChiao ArthurChiao Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aboch thanks for reviewing!

Not sure if I understand your comment correctly: if pkgHandle was initialized as &Handle{} as previously, the pkgHandle.sockets would be an empty map, and we can not set TO values for these sockets.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArturChiao now I remember.

With your proposed change, any binary depending on vishvananda/netlink will open 3 nl sockets, even when no netlink request is actually being executed (https://github.com/vishvananda/netlink/blob/v1.1.0/nl/nl_linux.go#L34).

Current behavior of this library has been: Open the nl socket only when you execute a pkg method (AddAddr(), AddLink(),...).

The addition of the Handle thing was done in a way so that above behavior was not modified.

I know it is limiting, but have you considered setting a default TO in func (req *NetlinkRequest) Execute() at https://github.com/vishvananda/netlink/blob/v1.1.0/nl/nl_linux.go#L406 instead ?

At the end of the day, a minute timeout (the default you are proposing) should satisfy most if not all use-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations! @aboch

Set TO for each socket we're going to use makes sense to me.
But I'm not sure if you would like to make the TO value configurable or not (e.g. hardcode 1 minute).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to set TO after we got a socket, please have a review, thanks! @aboch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL @aboch

@Ewocker
Copy link

Ewocker commented Feb 2, 2021

I was trying to see what was the problem for the patch.
For test case TestAddrAddReplace the output of AddList originally is [127.0.0.2/24 lo] whereas with the patch there is an extra entry [127.0.0.1/8 lo 127.0.0.2/24 lo].

@ArthurChiao ArthurChiao force-pushed the allow_configure_socket_timeout_for_default_handle branch from e5b8362 to 4cc0080 Compare February 2, 2021 06:50
@ArthurChiao
Copy link
Contributor Author

Hi @Ewocker, thanks for your reminder, I'll take a look at it!

@ArthurChiao ArthurChiao force-pushed the allow_configure_socket_timeout_for_default_handle branch from 4cc0080 to 6a11e6b Compare February 2, 2021 07:27
@Ewocker
Copy link

Ewocker commented Feb 2, 2021

Hi @Ewocker, thanks for your reminder, I'll take a look at it!

@ArthurChiao, asking out of curiosity, why was test TestAddrSubscribeWithOptions in original PR hitting the 1m TO? Also, why would the AddrAdd call create 127.0.0.1/8 in test TestAddrAddReplace ?

don't know too much about netlink, would really appreciate if you can answer my questions 😁

@ArthurChiao
Copy link
Contributor Author

@Ewocker actually I can't reproduce it in my develop env :(

And I found that even without my changes, many tests would fail in my dev node (already with root privilege), but with different reasons, such as the following one:

$ go test -v -run TestAddrAdd
=== RUN   TestAddrAdd
    TestAddrAdd: addr_test.go:90: Address flags not set properly, got=128, expected=140
--- FAIL: TestAddrAdd (0.00s)
=== RUN   TestAddrAddReplace
--- PASS: TestAddrAddReplace (0.00s)
FAIL
exit status 1
FAIL    github.com/vishvananda/netlink  0.004s

and this one:

--- FAIL: TestSEG6RouteAddDel (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x64863e]

goroutine 230 [running]:
testing.tRunner.func1.1(0x693980, 0x8ece40)
        /usr/local/go/src/testing/testing.go:940 +0x2f5
testing.tRunner.func1(0xc000305440)
        /usr/local/go/src/testing/testing.go:943 +0x3f9
panic(0x693980, 0x8ece40)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/vishvananda/netlink.TestSEG6RouteAddDel(0xc000305440)
        /home/arthurchiao/go/src/github.com/vishvananda/netlink/route_test.go:1176 +0xa9e

Haven't dug into these problems.

Copy link
Contributor

@brb brb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

@nr17
Copy link

nr17 commented Feb 2, 2021

@Ewocker actually I can't reproduce it in my develop env :(

And I found that even without my changes, many tests would fail in my dev node (already with root privilege), but with different reasons, such as the following one:

$ go test -v -run TestAddrAdd
=== RUN   TestAddrAdd
    TestAddrAdd: addr_test.go:90: Address flags not set properly, got=128, expected=140
--- FAIL: TestAddrAdd (0.00s)
=== RUN   TestAddrAddReplace
--- PASS: TestAddrAddReplace (0.00s)
FAIL
exit status 1
FAIL    github.com/vishvananda/netlink  0.004s

and this one:

--- FAIL: TestSEG6RouteAddDel (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
        panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x64863e]

goroutine 230 [running]:
testing.tRunner.func1.1(0x693980, 0x8ece40)
        /usr/local/go/src/testing/testing.go:940 +0x2f5
testing.tRunner.func1(0xc000305440)
        /usr/local/go/src/testing/testing.go:943 +0x3f9
panic(0x693980, 0x8ece40)
        /usr/local/go/src/runtime/panic.go:969 +0x166
github.com/vishvananda/netlink.TestSEG6RouteAddDel(0xc000305440)
        /home/arthurchiao/go/src/github.com/vishvananda/netlink/route_test.go:1176 +0xa9e

Haven't dug into these problems.

The Makefile (line#22) runs 4 tests in parallel. Tests setup and cleanup the new ns using setUpNetlinkTest.(netlink/netlink_test.go). Although the ns is supposed to be thread-local by making use of runtime.LockOSThread(), if the test goroutines running in parallel share the same underlying OS thread, they will interfere with each other.

@aboch aboch merged commit 9de6d08 into vishvananda:master Feb 18, 2021
@ArthurChiao ArthurChiao deleted the allow_configure_socket_timeout_for_default_handle branch February 18, 2021 04:33
brb added a commit to cilium/cilium that referenced this pull request Mar 5, 2021
The bumped version includes the change [1] which sets a timeout for a
default netlink socket. This prevents deadlocks when receiving a netlink
msg (netlink is not reliable proto).

[1]: vishvananda/netlink#614

Signed-off-by: Martynas Pumputis <[email protected]>
christarazi pushed a commit to cilium/cilium that referenced this pull request Mar 5, 2021
The bumped version includes the change [1] which sets a timeout for a
default netlink socket. This prevents deadlocks when receiving a netlink
msg (netlink is not reliable proto).

[1]: vishvananda/netlink#614

Signed-off-by: Martynas Pumputis <[email protected]>
aanm pushed a commit to cilium/cilium that referenced this pull request Mar 8, 2021
The bumped version includes the change [1] which sets a timeout for a
default netlink socket. This prevents deadlocks when receiving a netlink
msg (netlink is not reliable proto).

[1]: vishvananda/netlink#614

Signed-off-by: Martynas Pumputis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants