Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/crypto/ssh: Dial hangs in kexLoop indefinitely - ignoring ClientConfig.Timeout #51926

Open
pjbgf opened this issue Mar 24, 2022 · 4 comments
Open
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@pjbgf
Copy link

pjbgf commented Mar 24, 2022

What version of Go are you using (go version)?

$ go version
go version go1.17.8 linux/amd64

Does this issue reproduce with the latest release?

Yes, as this is library related using version:

golang.org/x/crypto v0.0.0-20220321153916-2c7772ba3064

I can confirm the issue also happens with previous versions:

golang.org/x/[email protected]

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/USER/.cache/go-build"
GOENV="/home/USER/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/USER/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/USER/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.17.8"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build762703969=/tmp/go-build -gno-record-gcc-switches"

What did you do?

The application implements a golang ssh transport that hangs indefinitely at ssh.Dial every so often.
The current timeout is set to 30 seconds, which ssh.Dial does not uphold (https://github.com/fluxcd/source-controller/blob/main/pkg/git/libgit2/managed/ssh.go#L251-L255 https://github.com/fluxcd/source-controller/blob/main/pkg/git/libgit2/managed/init.go#L30).

This is a low concurrency (2-4 parallel workers) application which creates multiple ssh connections to execute simple git operations.

The ssh.Dial uses the ssh.ClientConfig as below:

        ssh.ClientConfig{
		User:    username,
		Auth:    []ssh.AuthMethod{ssh.PublicKeys(key)},
		Timeout: 30 * time.Second,
	}

Actual code can be seen at:
https://github.com/fluxcd/source-controller/blob/main/pkg/git/libgit2/managed/ssh.go#L166

What did you expect to see?

The ssh.Dial operation error if the Dial operation took longer than the pre-configured timeout.

What did you see instead?

The goroutine hangs indefinitely. pprof shows the culprit being:

goroutine 25748 [select, 50 minutes]:
golang.org/x/crypto/ssh.(*handshakeTransport).kexLoop(0xc0003398c0)
	golang.org/x/[email protected]/ssh/handshake.go:268 +0x485
created by golang.org/x/crypto/ssh.newClientTransport
	golang.org/x/[email protected]/ssh/handshake.go:135 +0x23d
@gopherbot gopherbot added this to the Unreleased milestone Mar 24, 2022
@mknyszek
Copy link
Contributor

CC @FiloSottile

@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 24, 2022
@seankhliao seankhliao changed the title x/crypto: ssh.Dial hangs in kexLoop indefinitely - ignoring ClientConfig.Timeout x/crypto/ssh: Dial hangs in kexLoop indefinitely - ignoring ClientConfig.Timeout Mar 24, 2022
@pjbgf
Copy link
Author

pjbgf commented Mar 25, 2022

I have noticed that this seems to happen only on subsequent/concurrent connections. I could not reproduce it ever happening on the first connection. So potentially this is somehow related to #27140.

pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Mar 25, 2022
The underlying SSH connections are kept open and are reused
across several SSH sessions. This is due to upstream issues in
which concurrent/parallel SSH connections may lead to instability.

golang/go#51926
golang/go#27140
Signed-off-by: Paulo Gomes <[email protected]>
pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Mar 25, 2022
The underlying SSH connections are kept open and are reused
across several SSH sessions. This is due to upstream issues in
which concurrent/parallel SSH connections may lead to instability.

golang/go#51926
golang/go#27140
Signed-off-by: Paulo Gomes <[email protected]>
pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Mar 25, 2022
The underlying SSH connections are kept open and are reused
across several SSH sessions. This is due to upstream issues in
which concurrent/parallel SSH connections may lead to instability.

golang/go#51926
golang/go#27140
Signed-off-by: Paulo Gomes <[email protected]>
pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Mar 25, 2022
The underlying SSH connections are kept open and are reused
across several SSH sessions. This is due to upstream issues in
which concurrent/parallel SSH connections may lead to instability.

golang/go#51926
golang/go#27140
Signed-off-by: Paulo Gomes <[email protected]>
pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Mar 25, 2022
The underlying SSH connections are kept open and are reused
across several SSH sessions. This is due to upstream issues in
which concurrent/parallel SSH connections may lead to instability.

golang/go#51926
golang/go#27140
Signed-off-by: Paulo Gomes <[email protected]>
pjbgf pushed a commit to pjbgf/source-controller that referenced this issue Mar 28, 2022
The underlying SSH connections are kept open and are reused
across several SSH sessions. This is due to upstream issues in
which concurrent/parallel SSH connections may lead to instability.

golang/go#51926
golang/go#27140
Signed-off-by: Paulo Gomes <[email protected]>
@pjbgf
Copy link
Author

pjbgf commented May 26, 2022

By ensuring the session's StdoutPipe is serviced quickly, seems to resolve the problem, as mentioned on crypto/ssh comments:

https://github.com/golang/crypto/blob/eb4f295cb31f7fb5d52810411604a2638c9b19a2/ssh/session.go#L553-L558

@mvo5
Copy link
Contributor

mvo5 commented Feb 7, 2023

I ran into the issue that ssh.Dial() hangs forever too and I wonder if something like:

diff --git a/ssh/client.go b/ssh/client.go
index 6fd1994..a1300eb 100644
--- a/ssh/client.go
+++ b/ssh/client.go
@@ -174,6 +174,14 @@ func Dial(network, addr string, config *ClientConfig) (*Client, error) {
        if err != nil {
                return nil, err
        }
+       if config.Timeout > 0 {
+               if err := conn.SetDeadline(time.Now().Add(config.Timeout)); err != nil {
+                       return nil, err
+               }
+               defer func() {
+                       conn.SetDeadline(time.Time{})
+               }()
+       }
        c, chans, reqs, err := NewClientConn(conn, addr, config)
        if err != nil {
                return nil, err
@@ -231,7 +239,7 @@ type ClientConfig struct {
        // any of the CertAlgoXxxx and KeyAlgoXxxx constants.
        HostKeyAlgorithms []string
 
-       // Timeout is the maximum amount of time for the TCP connection to establish.
+       // Timeout is the maximum amount of time to establish the connection.
        //
        // A Timeout of zero means no timeout.
        Timeout time.Duration

is a reasonable way to address it? It fixes the issue for me with the hangs I was seeing.

If there is a chance that upstream is interested I could do a PR with a proper test etc but
of course there are questions as the semantic (slightly) changes and the timeout now encompasses
both the TCP connection and the initial setup plus it's actually two timeouts now so the updated comment
may be misleading.

mvo5 added a commit to mvo5/spread that referenced this issue Feb 7, 2023
This commit adds a thin wrapper around the real ssh.Dial() that
additionally sets a deadline on the underlying connection.

It is needed because a ssh.Dial() can happens right after the
reboot command is issued. The net.Dial() itself is successful but
then then during the ssh session setup the TCP connection ends
because of the reboot. The golang "ssh" package has no concpt of
"ssh -o ServerAliveInterval=10" or simialr so the code will
just hang in a read forever. This was observed running the
spread "cerberus" tests on ubuntu 23.04.

Note that half of the function is just a copy of
golang.org/x/crypto/ssh/client.go:func Dial()
and only the conn.SetDeadline() bits are new.

See also e.g. golang/go#51926 for
various bugreports about the golang "ssh" package and hangs.
nemith added a commit to nemith/netconf that referenced this issue Jun 15, 2023
`ssh.Dial()` took in a context that was used to establish the tcp
connection, however that context doesn't cover the ssh handshake which
can easily block indefinitely. This approximates context support for
ssh.NewClientConn() by having a go routine listen for context
cancellation and closing the connection. We can then check for ctx.Err()
and return that (i.e if the context was canceled).

Note that there is a `Timeout` field in `ssh.ClientConfig` but that also
only covers the TCP connection. See
golang/go#51926

Fixes: #53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

4 participants