More robust listener close in `riverpgxv5` #246

brandur · 2024-03-01T14:48:45Z

As part of #239, we're observing the possibility of the notifier going
into a hot loop where it's trying to reopen a listener connection, but
the listener won't let it because it thinks the connection is already
open.

A suspect is the listener's Close implementation, which in the event
of an error, returns the error and fails to release an underlying
connection, putting it into a state where it's never reusable.

Here, modify Close so that it always releases and unsets an underlying
connection regardless of the error state returned.

I tried to add a test case for this, but reading through pgx and net
source code, I couldn't find any way to simulate an error from Close
(I thought a cancelled context would do it, but it does not), so I had
to leave it.

Thanks @mfrister for pointing this one out!

Fixes #248.

brandur · 2024-03-01T14:52:44Z

@bgentry Seriously tried to add a test case for this, but couldn't find a way to make Close on the PG conn return an error. The only way to do it I think would be to add additional test-only interface functions to inject state, which would be kind of gross. Thoughts?

mfrister · 2024-03-01T15:20:41Z

Thanks for the fix!

I've deployed a version of my worker with riverpgxv5 from this branch and will report on Monday whether the issue appeared again. So far it looks good.

Enjoy your weekend!

Rshep3087 · 2024-03-01T15:31:06Z

There is this package from pgx https://github.com/jackc/pgx/blob/master/pgxtest/pgxtest.go. You can overwrite the CloseFunc on the test runner I believe.

brandur · 2024-03-01T15:41:28Z

@mfrister Excellent. Thanks!

@Rshep3087 Hm, just reading through this module, I believe ConnTestRunner is more of a helper for controlling the flow of test runs rather than something that can act as a mock for a pgx.Conn.

As part of #239, we're observing the possibility of the notifier going into a hot loop where it's trying to reopen a listener connection, but the listener won't let it because it thinks the connection is already open. A suspect is the listener's `Close` implementation, which in the event of an error, returns the error and fails to release an underlying connection, putting it into a state where it's never reusable. Here, modify `Close` so that it always releases and unsets an underlying connection regardless of the error state returned. I tried to add a test case for this, but reading through pgx and net source code, I couldn't find any way to simulate an error from `Close` (I thought a cancelled context would do it, but it does not), so I had to leave it.

bgentry

The only ways I can think of to test this are (a) use an interface over the conn here, which feels excessive, or (b) somehow force the server to uncleanly terminate the conn or reply with an error upon close.

I'm fine with not going forward with one of those approaches at this time, they seem to heavy for the benefit at this stage. Thanks for the quick fix

bgentry · 2024-03-02T02:11:40Z

riverdriver/riverpgxv5/river_pgx_v5_driver.go

-	if err := l.conn.Conn().Close(ctx); err != nil {
-		return err
-	}


Man, I noticed this change when I reviewed the mega-PR, should have said something 😞

Doh. Well, I was going to say "next time", but hopefully nothing like that PR ever happens again :)

brandur · 2024-03-02T04:20:38Z

The only ways I can think of to test this are (a) use an interface over the conn here, which feels excessive, or (b) somehow force the server to uncleanly terminate the conn or reply with an error upon close.

Yeah, I had the same reactions.

I think (a) is plausible, but has the downside that it makes the code a little more painful to use. Rather than being able to easily jump easily into concrete implementations, you now have to jump through the interface, which is always a bit annoying.

And RE: (b), I read through a lot of pgx source around Close, and man, as far as I can tell, it is quite difficult to ever get it to return an error. The underlying net.Conn under the pgx connection which is under the pgx pool connection will under some cases, but I couldn't find an obvious way to trigger it, especially from this level.

I'm fine with not going forward with one of those approaches at this time, they seem to heavy for the benefit at this stage. Thanks for the quick fix

Sweet. Yeah, that was my conclusion as well. Possible maybe to add some form of testing, but maybe more trouble than it's worth, and getting the fix out the door is the best compromise for the time being.

Thanks!

Contains an important fix from #246 which resolves a problem wherein a listener from the `riverpgxv5` driver couldn't be reused in cases where `Close` returned an error.

dhermes · 2024-03-02T17:24:12Z

Regarding unit testing, I have a little bit of experience here. I hunted down a bug in lib/pq a few years back and ended up replicating the bug locally while crafting a fix by using a TCP proxy that would send an RST packet after some signal had been received (but before the actual PostgreSQL connection was closed).

From what I understand, your unit tests hit a REAL PostgreSQL instance, so you could run a proxy within a Go unit test and put that proxy in front of the PostgreSQL instance available to the unit tests.

🤞 I'd hope a plain old TCP error over a non-TLS connection would do the trick, but maybe not. In my log capture around the issue that this fixed (#248), it was actually a TLS protocol issue that showed up in the error:

{
    "level": "error",
    "time": "2024-03-01T23:56:08.679126682Z",
    "notifier": {
        "err": {
            "error": "tls: failed to send closeNotify alert (but connection was closed anyway): write tcp 10.122.48.181:44034->10.122.30.240:5432: i/o timeout",
            "kind": "*fmt.wrapError",
            "stack": null
        }
    },
    "subsystem": "river",
    "message": "error closing listener"
}

However, if I had to guess I'd imagine the TCP connection is what had already been closed (e.g. via RST).

Here, follow up #246 by adding a few more tests that verify a listener's state after `Close` has been invoked, including if it returned an error, which we're able to simulate by overriding pgx's `DialFunc` and returning a stand-in stub for an underlying `net.Conn`. Also, remove the explicit `Close` call on an underlying connection in favor of just invoking the pool's `Release` function. In case of an error condition, `Release` will detect that and do the right thing, and pgx is better tested/vetted to make sure that right thing happens.

brandur · 2024-03-02T20:45:37Z

@dhermes Thanks for the added detail there! I'm sort of hoping to avoid a full-blown proxy for the time being if we can possibly do it (looked at yours — looks good, but definitely a substantial amount of new code). You gave me the idea though of getting something more like an "in code" proxy going, and I found a fairly low touch way of simulating a Close error using a combination of pgxpool's DialFunc and a lightweight net.Conn mock. See #250.

type connStub struct {
	net.Conn

	closeFunc func() error
}

func newConnStub(conn net.Conn) *connStub {
	return &connStub{
		Conn: conn,

		closeFunc: conn.Close,
	}
}

func (c *connStub) Close() error {
	return c.closeFunc()
}

		var connStub *connStub

		config := testPoolConfig()
		config.ConnConfig.DialFunc = func(ctx context.Context, network, addr string) (net.Conn, error) {
			// Dialer settings come from pgx's default internal one (not exported unfortunately).
			conn, err := (&net.Dialer{KeepAlive: 5 * time.Minute}).Dial(network, addr)
			if err != nil {
				return nil, err
			}

			connStub = newConnStub(conn)
			return connStub, nil
		}

Here, follow up #246 by adding a few more tests that verify a listener's state after `Close` has been invoked, including if it returned an error, which we're able to simulate by overriding pgx's `DialFunc` and returning a stand-in stub for an underlying `net.Conn`. Also, remove the explicit `Close` call on an underlying connection in favor of just invoking the pool's `Release` function. In case of an error condition, `Release` will detect that and do the right thing, and pgx is better tested/vetted to make sure that right thing happens.

mfrister · 2024-03-04T06:16:48Z

As promised above:

I've deployed a version of my worker with riverpgxv5 from this branch and will report on Monday whether the issue appeared again. So far it looks good.

The problem hasn't reappeared on my worker over the weekend, as expected.

brandur · 2024-03-04T06:39:03Z

@mfrister That's great to hear!! Thanks for checking in about it.

brandur mentioned this pull request Mar 1, 2024

Memory leak? #239

Closed

brandur requested a review from bgentry March 1, 2024 14:51

brandur force-pushed the brandur-robust-listener-close branch from 8ba4e4c to f566e2c Compare March 1, 2024 15:31

brandur force-pushed the brandur-robust-listener-close branch 2 times, most recently from cc2c92e to 45d7f15 Compare March 1, 2024 15:51

brandur force-pushed the brandur-robust-listener-close branch from 45d7f15 to 02267da Compare March 1, 2024 15:52

bgentry mentioned this pull request Mar 2, 2024

Possible connection re-use issue in 0.0.24 #248

Closed

bgentry approved these changes Mar 2, 2024

View reviewed changes

brandur merged commit cf178f7 into master Mar 2, 2024
8 of 10 checks passed

brandur deleted the brandur-robust-listener-close branch March 2, 2024 04:20

brandur added a commit that referenced this pull request Mar 2, 2024

Tee up version v0.0.24

2896b65

Contains an important fix from #246 which resolves a problem wherein a listener from the `riverpgxv5` driver couldn't be reused in cases where `Close` returned an error.

brandur mentioned this pull request Mar 2, 2024

Tee up version v0.0.25 #249

Merged

brandur added a commit that referenced this pull request Mar 2, 2024

Tee up version v0.0.25

43d199c

Contains an important fix from #246 which resolves a problem wherein a listener from the `riverpgxv5` driver couldn't be reused in cases where `Close` returned an error.

brandur added a commit that referenced this pull request Mar 2, 2024

Tee up version v0.0.25 (#249)

6b03de1

Contains an important fix from #246 which resolves a problem wherein a listener from the `riverpgxv5` driver couldn't be reused in cases where `Close` returned an error.

brandur mentioned this pull request Mar 2, 2024

Add tests for listener Close + invoke only pool release #250

Merged

upalsaha mentioned this pull request Mar 8, 2024

Intermittent listener close issue popping up again #256

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More robust listener close in `riverpgxv5` #246

More robust listener close in `riverpgxv5` #246

brandur commented Mar 1, 2024 •

edited by bgentry

Loading

brandur commented Mar 1, 2024

mfrister commented Mar 1, 2024

Rshep3087 commented Mar 1, 2024

brandur commented Mar 1, 2024

bgentry left a comment

bgentry Mar 2, 2024

brandur Mar 2, 2024

brandur commented Mar 2, 2024

dhermes commented Mar 2, 2024

brandur commented Mar 2, 2024 •

edited

Loading

mfrister commented Mar 4, 2024

brandur commented Mar 4, 2024

More robust listener close in riverpgxv5 #246

More robust listener close in riverpgxv5 #246

Conversation

brandur commented Mar 1, 2024 • edited by bgentry Loading

brandur commented Mar 1, 2024

mfrister commented Mar 1, 2024

Rshep3087 commented Mar 1, 2024

brandur commented Mar 1, 2024

bgentry left a comment

Choose a reason for hiding this comment

bgentry Mar 2, 2024

Choose a reason for hiding this comment

brandur Mar 2, 2024

Choose a reason for hiding this comment

brandur commented Mar 2, 2024

dhermes commented Mar 2, 2024

brandur commented Mar 2, 2024 • edited Loading

mfrister commented Mar 4, 2024

brandur commented Mar 4, 2024

More robust listener close in `riverpgxv5` #246

More robust listener close in `riverpgxv5` #246

brandur commented Mar 1, 2024 •

edited by bgentry

Loading

brandur commented Mar 2, 2024 •

edited

Loading