Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious conn busy - bug in QueryRow() ? #635

Open
kmansoft opened this issue Nov 7, 2019 · 21 comments
Open

Spurious conn busy - bug in QueryRow() ? #635

kmansoft opened this issue Nov 7, 2019 · 21 comments

Comments

@kmansoft
Copy link

kmansoft commented Nov 7, 2019

Hello,

I've just started using your library, pretty impressed, esp. with JSON handling.

I'm seeing spurious conn busy errors on my development machine. Once they start, they never stop until I restart the app.

The app uses pgx basically from two modules, one "does something" and stores "something" into pgsql, and another is an http API which fetches this data from pgsql. The failures always occur on the http side (which obviously uses separate coroutines from the first, "does something" part).

How can I track down what's causing these errors? I already checked my code for any leaked row sets or transactions, there don't seem to be none.

Second, related to the above. What is the default connection pool size if I create my connection like this?

pgsqlConfig, err := pgx.ParseConfig("postgres://user:pass@localhost:5432/db")
if err != nil {
	...
}

pgsqlConnection, err := pgx.ConnectConfig(context.Background(), pgsqlConfig)
if err != nil {
	...
}

And how can I change it (connection pool size)?

@kmansoft
Copy link
Author

kmansoft commented Nov 7, 2019

More info:

The query that fails for me, and it was the first query executed on a particular code path, is an UPSERT like this:

_, err := model.db.Exec(context.Background(),
	`INSERT INTO device (key, push_tech, push_token)
		VALUES($1, $2, $3) 
	ON CONFLICT (key) DO UPDATE
		SET push_tech = $2, push_token = $3`,
	key, pushTech, pushToken)
if err != nil {
	return err
}

fmt.Printf("Upserted device: %s\n", key)

return nil

I added a query just above that, like this:

rows, err := model.db.Query(context.Background(),
	`SELECT _id, key, push_tech, push_token
	FROM device WHERE key = $1`, key)
if err != nil {
	return err
}
defer rows.Close()

if rows.Next() {
	device := mail.SyncDbDevice{}
	err = rows.Scan(&device.Id, &device.Key, &device.PushTech, &device.PushToken)
	if err != nil {
		return err
	}

	fmt.Printf("Selected device: %d, %s, %s, %s\n", device.Id, device.Key, device.PushTech, device.PushToken)
}

return nil

So it looks like this:

http_handler() {
query() // DEBUG
upsert()
.....
}

Whenever the upsert starts to fail, the query just above it executes just fine, no errors.

I suppose the upsert is different from the select in that it (potentially) modifies the database, while a select does not. Still, not sure how to solve the conn busy error, would appreciate help.

@kmansoft
Copy link
Author

kmansoft commented Nov 12, 2019

I did some debugging, adding trace printouts to PgConn.lock and PgConn.unlock.

I believe there is a bug in Conn.QueryRow - it calls Query behind the scenes which locks the connection to read the data, but I'm unable to find where QueryRow calls rows.Close(). Normally that's the pattern I understand: rows := conn.Query()... rows.Next() ... rows.Close() and rows.Close() unlocks the connection.

But QueryRow doesn't seem to Close the rows. Perhaps I've missed it, or perhaps it really is missing.

Please comment if I'm wrong, meanwhile I'll try switching from QueryRow to Query to see if the issue goes away.

Edit: don't mind this comment. The simple reproducer case below fails with Query() the same way it fails with QueryRow() - i.e. with "conn busy".

@kmansoft kmansoft changed the title Spurious conn busy Spurious conn busy - bug in QueryRow() ? Nov 12, 2019
@kmansoft
Copy link
Author

kmansoft commented Nov 12, 2019

One more thing re: my earlier comment on Query continuing to work even when Upsert starting throwing "conn busy".

As it turns out, Query is missing error propagation when pgConn.lock() fails (with "conn busy").

The call stack: Query -> c.pgConn.ExecPrepared -> pgConn.execExtendedPrefix -> if err := pgConn.lock(); err != nil fails but the original Query call does not return this error, instead rows.Next silently fails (returns false).

@kmansoft
Copy link
Author

kmansoft commented Nov 12, 2019

A simple reproducer:

wg := sync.WaitGroup{}
wg.Add(10)
for coIter := 0; coIter < 10; coIter += 1 {
	go func() {
		for nowIter := 0; nowIter < 100; nowIter += 1 {
			var nowValue time.Time
			nowError := pgsqlConnection.QueryRow(context.Background(),
				"SELECT NOW()").Scan(&nowValue)
			if nowError != nil {
				fmt.Printf("QueryRow returned error %v\n", nowError)
				break
			}
			fmt.Printf("PG now() = %s\n", nowValue)
		}
		wg.Done()
	}()
}
wg.Wait()

You should see QueryRow return error conn busy - on my machine it happens 2/3 of the runs.

It also can throw a different error:

QueryRow returned error ERROR: bind message has 2 result formats but query has 1 columns (SQLSTATE 08P01)

Maybe something goes wrong in buffer / parameter management when the conn busy error is present.

@durango
Copy link

durango commented Nov 14, 2019

@kmansoft submitted a PR that should fix your issue. In the mean time, my current workaround was to acquire the pool manually for each query and then release it manually.

@kmansoft
Copy link
Author

kmansoft commented Nov 14, 2019

Thank you @durango I'll be looking forward to it getting merged.

In the meantime I'm wondering if I should switch from pgx connection to pgx pool. My simple test passes with pgxpool with no errors.

Edit: no, pgxpool also fails, please see below.

https://godoc.org/github.com/jackc/pgx/pgxpool

But coming from GO's "sql" package, I was expecting pgx to perform pooled connection management behind the scenes. Is this assumption wrong? Should I always be using a pgxpool in a multi-coroutine setup? Or is that not necessary? A pgx API design question really.

@kmansoft
Copy link
Author

About pgxpool. It also fails on concurrent workloads.

pgsqlConfig, err := pgxpool.ParseConfig("postgres://user:pass@localhost:5432/database")
if err != nil {
	fmt.Println(err)
	return
}

pgxPool, err := pgxpool.ConnectConfig(context.Background(), pgsqlConfig)
if err != nil {
	fmt.Println(err)
	return
}

defer func() {
	fmt.Println("Disconnecting from pgsql")
	pgxPool.Close()
}()

count := int64(0)
wg := sync.WaitGroup{}
wg.Add(10)
for coIter := 0; coIter < 20; coIter += 1 {
	go func() {
		for nowIter := 0; nowIter < 1000; nowIter += 1 {
			var nowValue time.Time
			nowError := pgxPool.QueryRow(context.Background(),
				"SELECT NOW()").Scan(&nowValue)
			if nowError != nil {
				fmt.Printf("QueryRow returned error %v\n", nowError)
				os.Exit(1)
			}
			fmt.Printf("PG now() = %s\n", nowValue)
			atomic.AddInt64(&count, 1)
		}
		wg.Done()
	}()
}
wg.Wait()
fmt.Printf("Total queries run: %d\n", count)

Result:

QueryRow returned error closed pool

The trick to triggering is that the number of coroutines needs to be 20 (10 works with no error).

@durango
Copy link

durango commented Nov 14, 2019

I was using pgxpool with Query and QueryRow and still hitting the same problem as you (also noted about the 10 vs 20 goroutines on my own so our experience seems to be very much he same). My fix wont work because the pool gets released to early for QueryRow unfortunately.

@durango
Copy link

durango commented Nov 14, 2019

@kmansoft my current working solution is to drop down to pgx's stdlib and use stdlib.AcquireConn(*sql.DB) within goroutines (then defer close that connection). So far this has been working well in production

@kmansoft
Copy link
Author

@durango thanks for the tip on AcquireConn... let's see what @jackc says about the issues with higher level APIs...

@jackc
Copy link
Owner

jackc commented Nov 16, 2019

Sorry, it took me a while to get to this. There are a number of questions that have come up in this issue and I'll try to answer them all in this post.

What is the default connection pool size

4 (https://github.com/jackc/pgx/blob/master/pgxpool/pool.go#L15)

And how can I change it (connection pool size)?

Set MaxConns in the pgxpool.Config struct or set it in the connection string with the pool_max_conns argument. (https://github.com/jackc/pgx/blob/master/pgxpool/pool.go#L183)

But QueryRow doesn't seem to Close the rows. Perhaps I've missed it, or perhaps it really is missing.

The underlying query started by QueryRow is closed when Scan is called on the returned Row. (https://github.com/jackc/pgx/blob/master/rows.go#L87)

But coming from GO's "sql" package, I was expecting pgx to perform pooled connection management behind the scenes. Is this assumption wrong? Should I always be using a pgxpool in a multi-coroutine setup? Or is that not necessary? A pgx API design question really.

The pgx package itself is strictly connection based. If you want a pool you need to use pgxpool (or some other syncronizatin / pooling mechanism). From the docs for pgx.Conn (https://godoc.org/github.com/jackc/pgx#Conn) "Conn is a PostgreSQL connection handle. It is not safe for concurrent usage.".

@kmansoft Regarding the failing test case with pgxpool: The reason it is failing is the line wg.Add(10). You tell the wait group to wait for 10 goroutines to complete, but then start 20. So as soon as 10 or the 20 goroutines complete the main thread blocked on wg.Wait() continues and the defer that closes the pool executes. But there are still 10 goroutines trying to use the now closed pool. This explains why you are seeing a closed pool error. If you correct that is works.

As an aside, it would be less error prone to do a wg.Add(1) inside of the loop instead of wg.Add(20) on the outside. Then there is only one place to change the number of workers.

@durango
Copy link

durango commented Nov 16, 2019

For the record, I used max pool settings this entire time (ranged from 20 - 400). Checked long-living queries [> 5 minutes] (there were none).

The underlying query started by QueryRow is closed when Scan is called on the returned Row. (https://github.com/jackc/pgx/blob/master/rows.go#L87)

Saw this yesterday actually, forgot to close my PR / was trying to resolve my pgx issues.

Tbh, I haven't really found a working solution for pgx other than using it as a connection driver. I had to drop down to *sql.DB and *sql.Conn. Something, somewhere, down the line connections were not being released despite closing/releasing them manually.

Fwiw I was also using v4 of pgx and have noticed less problems on v3 for pooling, but that could just be biases in observations. I've also noticed less problems when I stopped using CopyFrom and manually inserted. No code examples (since I've moved on / don't have a git history of this development) but I promise I wasn't trying anything fancier than what appears in the test files (other than the stdlib.Acquire part but that was after).

@kmansoft by any chance, are you connecting to a pgbouncer server?

@kmansoft
Copy link
Author

kmansoft commented Nov 16, 2019

@jackc Thank you for pointing out that "plain" pgx is not intended to be used concurrently. My bad.

With wg in the pool test fixed the test now runs without issues. Thanks again.

@durango no, not pgbouncer, just "plain" pg.

@sebnyberg
Copy link

Ran into the same issue when closing the acquired connection from stdlib but not properly closing pgx.BatchResults after running SendBatch.

@Zamiell
Copy link

Zamiell commented May 16, 2020

This was a good read, thanks.

For anyone coming across this thread from Google, the TLDR here is that in PostgreSQL land, you need to use either:

on top of pgx, or else you will get the dreaded "conn busy" error. This is because, as jackc mentions above, the default connection size is 4. And increasing the connection size it is a bad idea because vanilla pgx is only designed to be used by one process/thread at a time.

These two blogs are recommended reading if you want to know more:

@warent
Copy link

warent commented Sep 3, 2020

It's embarrassing to admit this but I want to share why this problem happened to me in case anyone else encounters it. I have an HTTP server and was trying to re-use the same connection for all requests. If you're doing this, be sure to use a single root/main-level pool, and then acquire a new connection from the pool per incoming http request (or, if you're using GraphQL, a connection per resolver / dataloader).

jacksgt added a commit to jacksgt/u9k that referenced this issue Sep 22, 2020
Using the pgxpool package to create a connection pool is safer than
using a single connection, because the DB functions are concurrently
accessed from multiple goroutines.

jackc/pgx#635 (comment)
@holmbuar
Copy link

holmbuar commented Nov 17, 2020

@warent do you return a pointer to the connection from the single main-level pool connect function, or use a global var in a separate database package?

@warent
Copy link

warent commented Nov 17, 2020

You'll probably want to use a global pool var in a separate database package, and create connections from that wherever you need them.

I'm using GraphQL so my http handler and my pool are in the same file (main.go). When the graphql endpoint is called, I put a pointer to the pool in ctx which is passed into the resolvers, and then every graphql resolver creates its own connection as needed.

Both approaches seem to work fine but in retrospect a global variable would probably be cleaner.

@bfontaine
Copy link
Contributor

Does it make sense to keep this issue open, or can we close it?

@ringerc
Copy link
Contributor

ringerc commented Aug 10, 2023

Per my comment #750 (comment) , take care before just blindly adopting pgbouncer as a proxy or reverse proxy between pgx and postgres. In session-pooling mode it won't help you much, and in transaction-pooling mode it changes the semantics of many postgres operations in ways that'll be very important for you to account for in your app's database usage. Even worse in statement-pooling mode, which is best avoided whenever possible.

If pgx works like other drivers in row-at-a-time mode, apps should try to consume the result-set promptly after a query, then close the result-set. The app should defer doing any expensive work (RPCs and other network I/O, nontrivial disk I/O, acquiring contended locks, etc) until after it reads the result and closes the result-set. Doing that should be enough to solve most problems unless the app is highly parallel and database-heavy.

@teplinsky-maxim
Copy link

If you came here to find out why your connection does "conn busy" always make sure you do rows.Close() after you read from them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants