Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly handle multiple requests to threaded REST API #3974

Merged
merged 2 commits into from
May 11, 2020

Conversation

CasperWA
Copy link
Contributor

@CasperWA CasperWA commented Apr 27, 2020

Fixes #3743

This PR implements the same fix as was used for aiida-optimade, wherein it closes the SQLAlchemy session used by QueryBuilder after having run a method/function to handle a GET request to the REST API.

If this is not done, the session should(?) still eventually close, releasing connections made in the SQLA connection pool. However, this may take such a long time that when a new GET request comes in (or more precisely, whenever the QueryBuilder is used), the newly created thread that handles the request cannot use the singleton session that AiiDA has, since the connections have not been released, and so it fails with the following message:

sqlalchemy.exc.TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30 (Background on this error at: http://sqlalche.me/e/3o7r)

At least that is how I remember the explanation @sphuber and I came up with some time ago. Hence @sphuber also cleaned up the whole session handling code in AiiDA and provided a public interface for closing the given session.

The major issue here is threading; the AiiDA SQLA session (that the QueryBuilder uses no matter the actual profile's database backend) is a singleton, and only meant to be used within the same (main) thread. Since the REST API runs threaded, as it should to accommodate multiple incoming requests effectively, the singleton AiiDA SQLA sessions design breaks down. However, by closing the session at the end of every request, the connections are immediately released and only in extreme cases will the QueuePool ever reach is limit. In fact, one may argue that in those cases, it should complain, since something is probably wrong/you're flooding the database with too many requests.

This issue was never actually tested with the current tests, since a threaded server is never started.
I have added a new test that starts the REST API server in a new Thread, issues 100 requests in serial, and finally shuts the server down.
This resulted in a reproduction of the QueuePool error, which was then fixed after implementing my fix.

A side note: Since the REST API (as far as I know, please back me up here @waychal) only serves data from the database, and never alters or creates data in the database, there is no risk of corrupting the database by bombarding the REST API with requests.

Last thing: @sphuber, should the close_session decorator function be placed somewhere else in the code? I think I would like to perhaps use this directly in aiida-optimade as well, instead of re-implementing it.
Also, would I then have to move the from aiida.manage.manager import get_manager import line into the finally: section of the close_session() function again, where it was originally placed, in order to not unintentionally load the AiiDA profile before it's truly time?

@codecov
Copy link

codecov bot commented Apr 27, 2020

Codecov Report

Merging #3974 into develop will increase coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3974      +/-   ##
===========================================
+ Coverage    78.50%   78.57%   +0.07%     
===========================================
  Files          461      461              
  Lines        34094    34095       +1     
===========================================
+ Hits         26763    26787      +24     
+ Misses        7331     7308      -23     
Flag Coverage Δ
#django 70.59% <85.19%> (+0.05%) ⬆️
#sqlalchemy 71.44% <81.49%> (+0.04%) ⬆️
Impacted Files Coverage Δ
aiida/manage/tests/unittest_classes.py 0.00% <ø> (ø)
aiida/backends/djsite/__init__.py 100.00% <100.00%> (+33.34%) ⬆️
aiida/backends/djsite/manager.py 82.03% <100.00%> (+2.72%) ⬆️
aiida/backends/manager.py 72.42% <100.00%> (ø)
aiida/backends/sqlalchemy/__init__.py 100.00% <100.00%> (+33.34%) ⬆️
aiida/backends/sqlalchemy/manager.py 78.00% <100.00%> (+2.00%) ⬆️
aiida/restapi/common/identifiers.py 71.52% <100.00%> (ø)
aiida/restapi/common/utils.py 71.70% <100.00%> (+0.46%) ⬆️
aiida/restapi/resources.py 94.56% <100.00%> (-0.15%) ⬇️
aiida/restapi/run_api.py 90.70% <100.00%> (-0.21%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68fabf0...11130a4. Read the comment docs.

@CasperWA CasperWA force-pushed the fix_3743_handle_threads_rest_api branch 2 times, most recently from d3ca10c to 23b1a91 Compare April 27, 2020 10:21
@CasperWA CasperWA added the pr/ready-for-review PR is ready to be reviewed label Apr 27, 2020
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @CasperWA , looking good in general with a few minor things and a design question/suggestion.

Last thing: @sphuber, should the close_session decorator function be placed somewhere else in the code?

Any code path that will cause the QueryBuilder to be constructed by a thread should have this. I have the feeling that the get method would be a good place to catch all of these, but I am not sure. We should check with @waychal and @asle85

Also, would I then have to move the from aiida.manage.manager import get_manager import line into the finally: section of the close_session() function again, where it was originally placed, in order to not unintentionally load the AiiDA profile before it's truly time?

You can import the get_manager function safely without having to have loaded a profile.

aiida/restapi/common/identifiers.py Outdated Show resolved Hide resolved
tests/restapi/test_threaded_restapi.py Outdated Show resolved Hide resolved
tests/restapi/test_threaded_restapi.py Outdated Show resolved Hide resolved
aiida/restapi/common/utils.py Outdated Show resolved Hide resolved
aiida/restapi/resources.py Outdated Show resolved Hide resolved
tests/restapi/test_threaded_restapi.py Outdated Show resolved Hide resolved
tests/restapi/test_threaded_restapi.py Outdated Show resolved Hide resolved
tests/restapi/test_threaded_restapi.py Outdated Show resolved Hide resolved
@CasperWA
Copy link
Contributor Author

Last thing: @sphuber, should the close_session decorator function be placed somewhere else in the code?

Any code path that will cause the QueryBuilder to be constructed by a thread should have this. I have the feeling that the get method would be a good place to catch all of these, but I am not sure. We should check with @waychal and @asle85

I didn't mean where to use the decorater, I believe this is already the best place where it is now.
I meant more the actual decorator code, i.e., the close_session() function. Should it be moved to aiida.backends.sqlalchemy instead, together with the rest of the session-related code?

Also, would I then have to move the from aiida.manage.manager import get_manager import line into the finally: section of the close_session() function again, where it was originally placed, in order to not unintentionally load the AiiDA profile before it's truly time?

You can import the get_manager function safely without having to have loaded a profile.

Great! Thanks 👍

@sphuber
Copy link
Contributor

sphuber commented Apr 27, 2020

I didn't mean where to use the decorater, I believe this is already the best place where it is now.
I meant more the actual decorator code, i.e., the close_session() function. Should it be moved to aiida.backends.sqlalchemy instead, together with the rest of the session-related code?

Ah I see, no it shouldn't go to aiida.backends.sqlalchemy because it is also need when you run with a django backend. This is why I created the get_session on the Backend class, such that it can be gotten in a backend independent way. Since this is the offical API to dealing with the API, we don't need to add additional code.

@CasperWA
Copy link
Contributor Author

I didn't mean where to use the decorater, I believe this is already the best place where it is now.
I meant more the actual decorator code, i.e., the close_session() function. Should it be moved to aiida.backends.sqlalchemy instead, together with the rest of the session-related code?

Ah I see, no it shouldn't go to aiida.backends.sqlalchemy because it is also need when you run with a django backend. This is why I created the get_session on the Backend class, such that it can be gotten in a backend independent way. Since this is the offical API to dealing with the API, we don't need to add additional code.

I'll try and ask again then: Do you think the close_session() should be moved somewhere else in the code than where it is now? (And if yes, where?) Because I would not consider it an intrinsic part of the REST API, but rather belonging to AiiDA core and its session handling.

@sphuber
Copy link
Contributor

sphuber commented Apr 27, 2020

I'll try and ask again then: Do you think the close_session() should be moved somewhere else in the code than where it is now? (And if yes, where?) Because I would not consider it an intrinsic part of the REST API, but rather belonging to AiiDA core and its session handling.

I have answered this in the PR comments themselves. I suggested defining the decorator on the BaseResource class itself, as it is there where it is used. When I refactored the session handling code, I decided to provide get_session and not just close_session because maybe the caller would need to do have access to the session. Since closing the session is as simple as get_session().close() I didn't feel it was necessary to have an explicit close_session method. Since the context manager you wrote is just one of the many potential ways to deal with the problem at hand, I don't really see why this has to be some standard utility in AiiDA.

@CasperWA
Copy link
Contributor Author

I have answered this in the PR comments themselves. (...)

LazyMan (aka. me ...) strikes again!

LazyMan strikes again!

@CasperWA CasperWA force-pushed the fix_3743_handle_threads_rest_api branch from 23b1a91 to 6d71d4a Compare April 27, 2020 17:26
@CasperWA
Copy link
Contributor Author

With the latest commits I have moved to the pytest way of testing, as well as addressed all other review comments by @sphuber.
Furthermore, an extra commit adds another test - I thought this was relevant to actually validate the result of the "first" test, in the sense that it tests a REST API endpoint/resource without the close_session decorator.
The issue with this test is the default time-out for QueuePool connections (30 s), since it means the test will at minimum always take this long ... I haven't found a way to set the time-out without also getting a setup that doesn't properly reflect how the REST API will actually run.

@CasperWA CasperWA force-pushed the fix_3743_handle_threads_rest_api branch from 6d71d4a to b9f1075 Compare April 27, 2020 17:36
@CasperWA
Copy link
Contributor Author

@sphuber, it seems the fixture aiida_localhost doesn't work as expected?

@ltalirz
Copy link
Member

ltalirz commented Apr 27, 2020

This is likely related to the discussion here.

In essence, I think the issue is rooted in interference between the pytest fixtures and the remaining AiiDATestCase implementation.
In the PR I linked we decided to remove the change again since we wanted to release. I wrote

Note, however, that we will likely run into this issue again soon, namely when people start to use pytest fixtures in the test suite more actively.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there @CasperWA thanks a lot. Just a few more minor things

aiida/restapi/resources.py Outdated Show resolved Hide resolved
aiida/restapi/common/utils.py Show resolved Hide resolved
aiida/restapi/common/utils.py Outdated Show resolved Hide resolved
aiida/restapi/resources.py Show resolved Hide resolved
tests/restapi/test_threaded_restapi.py Outdated Show resolved Hide resolved
@CasperWA
Copy link
Contributor Author

This is likely related to the discussion here.

In essence, I think the issue is rooted in interference between the pytest fixtures and the remaining AiiDATestCase implementation.
In the PR I linked we decided to remove the change again since we wanted to release. I wrote

Note, however, that we will likely run into this issue again soon, namely when people start to use pytest fixtures in the test suite more actively.

Well that's crap.

So what now? Should I undo my conversion to pytest or should I make and store a Computer "manually" or what is the solution, do you think?

@CasperWA CasperWA added pr/on-hold PR should not be merged and removed pr/ready-for-review PR is ready to be reviewed labels Apr 28, 2020
@greschd
Copy link
Member

greschd commented Apr 28, 2020

So what now?

You could try using the clear_database_before_test fixture. My understanding is that some of the "old-style" tests are trashing the database (specifically here, the user). The "good" solution here would be finding and fixing those tests, but the workaround is to reset the database before running "new-style" tests.

Should I undo my conversion to pytest

Please don't 😅

aiida/restapi/resources.py Outdated Show resolved Hide resolved
@CasperWA CasperWA force-pushed the fix_3743_handle_threads_rest_api branch from f9336c0 to a86e84f Compare April 28, 2020 13:53
@sphuber
Copy link
Contributor

sphuber commented Apr 29, 2020

But isn't this a bit too hacky?

Not at all. There is a reason that SqlAlchemy provides a way to tweak these parameters. We just haven't exposed the same interface in AiiDA because so far it had not been necessary yet. It is fine to add this for the purposes of the tests. It would of course also open it up to users changing, but the get_scoped_session is not part of the public API so this should serve as enough of a warning.

@CasperWA CasperWA force-pushed the fix_3743_handle_threads_rest_api branch 2 times, most recently from 6b279ca to 5691668 Compare April 30, 2020 08:53
@CasperWA
Copy link
Contributor Author

CasperWA commented Apr 30, 2020

All right!
It seems that this is now (finally) working "reliably" for Py3.8 (edit: run a couple of times now).
However, for Py3.5, shutting down the server, doesn't seem to close the Thread.

Also, I am unsure whether it was the switch to check the capfd instead of capsys or have a small sleep period before shutting down the server that made sure the correct output could be found in the traceback thrown for the test that doesn't apply the fix.
I want to test this by switching back to using capsys, run the tests again, and then I guess I'll have to setup a local Py3.5 environment to adapt the tests to work with older py3 version. NOTE: The fix will work for all supported versions, this is merely an issue with the test for reproducing the original error, i.e., I suggest (based on an offline comment by @ltalirz) to eventually skip the test (perhaps just for Py3.5, or all-together) and merge the PR, since the fix is somewhat critical.

Finally, @sphuber, please review specifically whether my trailing of **kwargs through the code base to end up being passed to sqlalchemy.create_engine fits with your intended pathway. Note, I wanted to make it part of a public function, since this seems to be the best practice, right?

@CasperWA
Copy link
Contributor Author

CasperWA commented May 1, 2020

Update:
I have set up a local Py3.5 environment and could reproduce the never-ending tests also seen here in the GH Actions CI.
Doing a bit of research I found out that the socket was stuck in an infinite loop trying to read.
I then found this and simply setting a timeout together with the get requests, seemed to work and get the socket out of the loop, and catching the subsequent exception (and passing on it) made the tests run locally.

Now to see if this is also the case with GH Actions ...

@CasperWA
Copy link
Contributor Author

CasperWA commented May 1, 2020

https://github.com/aiidateam/aiida-core/runs/636184358

🤩 ❤️ 😍 🎉 🥳

@CasperWA CasperWA requested review from sphuber and greschd May 1, 2020 11:29
@CasperWA
Copy link
Contributor Author

CasperWA commented May 1, 2020

https://github.com/aiidateam/aiida-core/runs/636184358

🤩 ❤️ 😍 🎉 🥳

However, it does seem that the whole traceback sometimes still doesn't get registered and captured in capfd: https://github.com/CasperWA/aiida_core/runs/636183960

@CasperWA CasperWA added pr/ready-for-review PR is ready to be reviewed and removed pr/on-hold PR should not be merged labels May 1, 2020
Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @CasperWA . Last thing I would ask is to adapt the docstrings as indicated. Once you have done that, please create a first separate commit that introduces the new kwargs piping and then the second commit with the actual fix and tests on top, then I will merge it.

aiida/backends/djsite/__init__.py Outdated Show resolved Hide resolved
@CasperWA CasperWA removed the pr/ready-for-review PR is ready to be reviewed label May 7, 2020
@CasperWA CasperWA force-pushed the fix_3743_handle_threads_rest_api branch from b711b1b to e780e3d Compare May 7, 2020 09:20
@CasperWA CasperWA added the pr/ready-for-review PR is ready to be reviewed label May 7, 2020
@CasperWA CasperWA requested a review from sphuber May 7, 2020 09:20
@CasperWA
Copy link
Contributor Author

CasperWA commented May 7, 2020

@sphuber Wait a second with merging - I just realized I should use our "new" get_decorators argument when initiating the mock REST API class in one of my tests, instead of also creating a mock resource.
This way the get_decorators argument will be tested also.

@CasperWA CasperWA force-pushed the fix_3743_handle_threads_rest_api branch 2 times, most recently from 95efcdd to acfe8ee Compare May 7, 2020 10:05
@greschd greschd removed their request for review May 7, 2020 10:10
@CasperWA
Copy link
Contributor Author

CasperWA commented May 7, 2020

@sphuber All right, it seems to be all fixed with docs building etc. now. Also, I have updated the test and it seems to work.
It was a good thing as well, I found out the actual way to pass this kwarg to the Resource.
Aaaand out-of-date ....

@CasperWA CasperWA force-pushed the fix_3743_handle_threads_rest_api branch from acfe8ee to e502170 Compare May 7, 2020 10:14
The function `aiida.backends.utils.create_sqlalchemy_engine` now takes
keyword arguments that are passed straight to the SQLAlchemy function
`create_engine` which creates the engine that connects to the database.
This utility function is used by the SQLAlchemy and by both backends
for the `QueryBuilder`. By allowing these keyword arguments to be passed
and plumbing them all the way in to the `Backend` class, the parameters
of the SQLAlchemy engine and the associated queue pool can be
customized. This is not meant to be used by end users, but can be
important for important use, for testing and performance reasons.
This is needed due to the server running in threaded mode, i.e.,
creating a new thread for each incoming request. This concept is great
for handling many requests, but crashes when used together with AiiDA's
global singleton SQLA session used, no matter the backend of the profile
by the `QueryBuilder`.

Specifically, this leads to issues with the SQLA QueuePool, since the
connections are not properly released when a thread is closed. This
leads to unintended QueuePool overflow.

This fix wraps all HTTP method requests and makes sure to close the
current thread's SQLA session after the request as been completely
handled.

Use Flask-RESTful's integrated `Resource` attribute `method_decorators`
to apply `close_session` wrapper to all and any HTTP request that may
be requested of AiiDA's `BaseResource` (and its sub-classes).

Additionally, remove the `__init__` function overwritten in
`Node(BaseResource)`, since it is redundant, and the attributes `tclass`
is not relevant with v4 (AiiDA v1.0.0 and above), but was never removed.
It should have been removed when moving to v4 in 4ff2829.

Concerning the added tests: the timeout needs to be set for Python 3.5
in order to stop the http socket and properly raise (and escape out of
an infinite loop). The `capfd` fixture must be used, otherwise the
exception cannot be properly captured.

The tests were simplified into the pytest scheme with ideas from
@sphuber and @greschd.
@sphuber sphuber force-pushed the fix_3743_handle_threads_rest_api branch from e502170 to 11130a4 Compare May 11, 2020 09:54
@sphuber sphuber merged commit b9d4bbe into aiidateam:develop May 11, 2020
@sphuber sphuber deleted the fix_3743_handle_threads_rest_api branch May 11, 2020 10:11
@sphuber
Copy link
Contributor

sphuber commented May 11, 2020

Thanks a lot @CasperWA !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr/ready-for-review PR is ready to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make sure threads are handled correctly for the REST API
4 participants