Python exceptions raised in a wait_callback do not always propagate #410

dhagrow · 2016-02-16T20:52:24Z

I haven't been able to put together a minimal example of this, so unfortunately I have only been able to get this problem to manifest within the scope of a project I am on. I will continue to see if I can put together an example, but in the meantime the best I can offer is some context.

The project is using SQLAlchemy on top of psycopg2, using psycogreen to provide this gevent wait_callback:

def gevent_wait_callback(conn, timeout=None):
    """A wait callback useful to allow gevent to work with Psycopg."""
    while 1:
        state = conn.poll()
        if state == extensions.POLL_OK:
            break
        elif state == extensions.POLL_READ:
            wait_read(conn.fileno(), timeout=timeout)
        elif state == extensions.POLL_WRITE:
            wait_write(conn.fileno(), timeout=timeout)
        else:
            raise psycopg2.OperationalError(
                "Bad result from poll: %r" % state)

The problem I'm seeing is that an exception that is raised within this callback is sometimes not being propagated to the Python code that is making calls to psycopg2 via SQLAlchemy. Instead, I get an OperationalError('unknown error').

To be specific, I am killing greenlets, which raises a GreenletExit exception in the greenlet when it's execution resumes. In some cases, GreenletExit is getting raised in the wait_callback. Usually, GreenletExit is then also raised in the code that makes an SQLAlchemy call (e.g. session.query(Model).filter().one()). The problem is that sometimes I get the OperationalError instead.

I've toyed around in the C layer of psycopg2, and I can see that PyErr_Occurred() returns GreenletExit after the callback is called in green.c:psyco_wait(). That is somehow getting lost in some cases before the C layer returns control to Python, and it ends up in pqpath.c:pq_complete_error() which raises OperationalError('unknown error').

This seems to me like a bug along some specific code path, but I haven't been able to trace exactly where it goes astray and the exception gets lost.

The text was updated successfully, but these errors were encountered:

dvarrazzo · 2016-02-18T13:49:55Z

Hi,

Can you please run your tests with the current master or maint_2_6 branches? I've done some cleanup to errors propagation to fix #281 so it could have moved something affecting what you are investigating.

If you manage to put together some testing rig to reproduce the issue let us know.

dhagrow · 2016-02-18T15:49:56Z

No I'm afraid that neither branch seems to solve the issue.

Still no luck with a test script, but I'll post one when I can.

underyx · 2017-04-03T13:07:20Z

Seems like we're still bumping into the same issue with

gevent==1.2.1
greenlet==0.4.12
psycogreen==1.0
psycopg2==2.6.2
SQLAlchemy==1.1.6

underyx · 2017-04-04T09:24:15Z

We've tried bumping the versions of all these dependencies and still had the issue with

gevent==1.2.1
greenlet==0.4.12
psycogreen==1.0
psycopg2==2.7.1
SQLAlchemy==1.1.8

dvarrazzo · 2017-04-04T11:36:53Z

@underyx I don't have a way to reproduce this issue, I am still waiting for one.

underyx · 2017-04-04T12:12:46Z

@dvarrazzo Right, I was working on that, and I came up with a rather silly script, that does reproduce the issue, but depends on random chance a bit.

Install the following packages: pip install gevent==1.2.1 greenlet==0.4.12 psycogreen==1.0 psycopg2==2.7.1 SQLAlchemy==1.1.8

Run this code:

from gevent import monkey; monkey.patch_all()
from psycogreen.gevent import patch_psycopg; patch_psycopg()

from gevent import spawn
from gevent.pool import Pool
import psycopg2
from sqlalchemy import create_engine
import time

engine = create_engine('postgresql://localhost')


def sleepy():
    try:
        conn = engine.raw_connection()
        cur = conn.cursor()
        cur.execute('SELECT pg_sleep(0.1)')
        cur.close()
        conn.close()
    except:
        import traceback; traceback.print_exc()


def killy(greenlets):
    for greenlet in greenlets:
        time.sleep(0.001)
        greenlet.kill()


def main():
    pool = Pool()
    for _ in range(100):
        greenlets = [pool.apply_async(sleepy) for _ in range(100)]
        spawn(killy, greenlets)
        pool.join()


if __name__ == '__main__':
    main()

Out of 10000 queries this results in around 80 unknown errors for me. Use python repro.py 2>&1 | grep unknown -C 10 to filter out all the other exceptions.

dvarrazzo · 2017-04-04T12:53:46Z

@underyx I'll give it a run, thank you very much.

dvarrazzo · 2017-04-04T14:16:29Z

@underyx I've tried run your script, then I've tried running your script with 10000 instead of 100 in the outer loop. Twice. So it ran >2M queries, and I didn't see a single unknown :(

Maybe a source of difference is the libpq version? I've tried mine with 9.6.1:

$ pip freeze
SQLAlchemy==1.1.8
argparse==1.2.1
gevent==1.2.1
greenlet==0.4.12
psycogreen==1.0
psycopg2==2.7.1
wsgiref==0.1.2

$ python -c 'import psycopg2; print psycopg2.extensions.libpq_version()'
90601

$ time python test-410.py 2>&1 | grep unknown -C 10

real	23m16.487s
user	5m28.916s
sys	1m23.760s

For completeness, I've run my test on python 2.7 on Ubuntu 16.04 64 bits. Tests run on 2.7.1 rather than on master to verify if it is a case of #539, but even on 2.7.1 I don't seem able to reproduce it.

underyx · 2017-04-04T14:58:29Z

That's fascinating. The three environments where I've seen this error are:

my dev machine with macOS 10.12.4, python 3.5, libpq 90504
our production instance with Ubuntu 16.04, python 2.7, libpq 90602
the below Docker image with Alpine 3.4, python 3.6, libpq 90506

I've published this repo with a Dockerfile and a docker-compose.yml you can use to launch a database and run the script in a way that triggers the failure. I got 500 unknown errors with it out of 10000 queries.

PS: I'm sure you didn't make this mistake, but just in case: if time there resolves to the binary instead of the shell function, grep will never get a chance to see the script's output.

PPS: I timed my script execution and it came out to 20 seconds for 100 loops. That'd be 33 minutes for 10000 loops, which is a lot more than what it took for you. Since the timing is important here (you need to raise GreenletExit at the correct time), maybe if your connection to localhost or something is faster than mine it's possible you wouldn't see the errors.

dvarrazzo · 2017-04-04T15:30:30Z

I would have exactly asked you for a Dockers thing that I could have tested: thank you very much for providing that, I appreciate the effort and will try to look into it again.

I used time only the second time I ran the script, to have a ballpark measure of how long it takes to run, not the first time, so I don't think that was the case. It is likely that the difference is given by timing: my setup seems similar to your production one: in this case the Docker test may fail too... I'll try to run it on my laptops, on my Arduino...

dvarrazzo · 2017-04-04T17:36:32Z

maybe if your connection to localhost or something

Ok, that got me thinking about difference between network/unix socket connections... tested better and my run of the script was just failing to connect because missing password :\

Fixed pg_hba.conf and now the script spits a handful of unknown error in the 10K run too.

I'll try to look a bit better if I can understand what causes them.

underyx · 2017-04-04T17:38:55Z

the script was just failing to connect because missing password

At least I'll be able to sleep properly tonight! :D

The 'unknown error' happens on query.

Close #410

dvarrazzo · 2017-04-05T13:50:06Z

Ok nailed it. I managed to set up a test not based on concurrency and added it to the test suite.

The 'unknown error' happens on query.

underyx · 2017-04-05T13:56:25Z

Wow, that's amazing, thank you! We'll test this commit in our production env to confirm the bug to be fixed.

dvarrazzo · 2017-04-05T14:19:52Z

That would be great: let us know, thank you very much.

underyx · 2017-04-05T15:28:35Z

Yesterday we got 66 of these errors over the span of around an hour. We just ran for 40 minutes with psycopg2 at 4b4d279 and we got

0 unknown errors
55 of a different exception (which must be what was masked yesterday)

So, I can confirm that this issue is fixed! Thanks again 🙃

dvarrazzo · 2017-04-05T15:32:12Z

Perfect, thank you for testing!

jm66 mentioned this issue Mar 3, 2017

OperationalError('(psycopg2.OperationalError) unknown error',) celery/celery#3886

Closed

2 tasks

dvarrazzo added this to the psycopg 2.7.2 milestone Apr 4, 2017

dvarrazzo added a commit that referenced this issue Apr 5, 2017

Added test to verify #410

e059895

The 'unknown error' happens on query.

dvarrazzo added a commit that referenced this issue Apr 5, 2017

Don't clobber a Python exception with an unknown error

7ab5302

Close #410

dvarrazzo added a commit that referenced this issue Apr 5, 2017

Added test to verify #410

47f5e97

The 'unknown error' happens on query.

dvarrazzo closed this as completed in a66c34a Apr 5, 2017

underyx mentioned this issue Jun 8, 2017

Time to release 2.7.2? #564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python exceptions raised in a wait_callback do not always propagate #410

Python exceptions raised in a wait_callback do not always propagate #410

dhagrow commented Feb 16, 2016

dvarrazzo commented Feb 18, 2016

dhagrow commented Feb 18, 2016

underyx commented Apr 3, 2017

underyx commented Apr 4, 2017

dvarrazzo commented Apr 4, 2017

underyx commented Apr 4, 2017 •

edited

Loading

dvarrazzo commented Apr 4, 2017

dvarrazzo commented Apr 4, 2017

underyx commented Apr 4, 2017 •

edited

Loading

dvarrazzo commented Apr 4, 2017

dvarrazzo commented Apr 4, 2017

underyx commented Apr 4, 2017 •

edited

Loading

dvarrazzo commented Apr 5, 2017

underyx commented Apr 5, 2017

dvarrazzo commented Apr 5, 2017

underyx commented Apr 5, 2017 •

edited

Loading

dvarrazzo commented Apr 5, 2017

Python exceptions raised in a wait_callback do not always propagate #410

Python exceptions raised in a wait_callback do not always propagate #410

Comments

dhagrow commented Feb 16, 2016

dvarrazzo commented Feb 18, 2016

dhagrow commented Feb 18, 2016

underyx commented Apr 3, 2017

underyx commented Apr 4, 2017

dvarrazzo commented Apr 4, 2017

underyx commented Apr 4, 2017 • edited Loading

dvarrazzo commented Apr 4, 2017

dvarrazzo commented Apr 4, 2017

underyx commented Apr 4, 2017 • edited Loading

dvarrazzo commented Apr 4, 2017

dvarrazzo commented Apr 4, 2017

underyx commented Apr 4, 2017 • edited Loading

dvarrazzo commented Apr 5, 2017

underyx commented Apr 5, 2017

dvarrazzo commented Apr 5, 2017

underyx commented Apr 5, 2017 • edited Loading

dvarrazzo commented Apr 5, 2017

underyx commented Apr 4, 2017 •

edited

Loading

underyx commented Apr 4, 2017 •

edited

Loading

underyx commented Apr 4, 2017 •

edited

Loading

underyx commented Apr 5, 2017 •

edited

Loading