Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving upserts #79

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

jorisgillis
Copy link

In my setup I ran into problems with updates of notebooks. Because the insert failed, the transaction ended up in a weird state. As a consequence the update was not executed.
This PR changes the code to first check whether a file exists or not. If it does not exist, an insert is performed. If the file does exist an update query is executed.

Also added a paragraph to the README, to explain why some tables can become "invisible" when inspecting the database, due to the table being in a different namespace.

Joris Gillis added 2 commits October 13, 2021 14:03
Adding paragraph on schema's in PostgreSQL.
First, a query is executed to check whether the file
already exists. If not, it is inserted. If it does exist
the file is updated.

The current code was given issues because the transaction
was closed after the failure.

In the future it might be better to use a SQLAlchemy Session
which handles upserts under the hood.
Copy link
Contributor

@ssanderson ssanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @jorisgillis. Thanks for the PR! I'm not sure that the updated strategy here is the right way to handle conflicts on updates. Do you have any more info on the bad state that your setup got into?

@@ -510,7 +510,7 @@ def save_file(db, user_id, path, content, encrypt_func, max_size_bytes):
)
directory, name = split_api_filepath(path)
with db.begin_nested() as savepoint:
try:
if not file_exists(db, user_id, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nervous about handling this by checking file_exists before executing the insert, since that allows for a possible race condition where another transaction inserts a file in between the exists check and the insert. That worry was the reason that this operation is currently implemented as "try to insert, but handle unique violations if they occur". Do you have any more info on your transaction that ended up in weird state?

Also, since this code was written, postgresql and sqlalchemy have added support for proper upserts directly in sql (see https://docs.sqlalchemy.org/en/14/dialects/postgresql.html#insert-on-conflict-upsert, for instance). Using that would probably be the "right" way to fix this if there's a problem with the current error handling strategy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm new to SQLALchemy, so forgive my ignorance.

I was under the assumption that the select and insert/update would happen in a single transaction, because of the db.begin_nested() in the with statement. Hence, if two transaction are trying to save (a new notebook) simultaneously, one would win, the other would need to retry a save and get updated in the second attempt.

I get the following exceptions when using pgcontents version 0.6.0:

First, the insert fails

psycopg2.errors.UniqueViolation: duplicate key value violates unique constraint "uix_filepath_username"
DETAIL:  Key (user_id, parent_name, name)=(my_awesome_username, /, Untitled.ipynb) already exists.

Then

sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager.  Please complete the context manager before emitting further commands.

If I understand it correctly, that means that the transaction started by the db.begin_nested() has failed in some way and a new transaction needs to be started?

On the upsert: very interesting. My initial attempt at using the on_conflict_do_update fails with Insert does not have an attribute on_conflict_do_update. This is the code fragment:

res = db.execute(
            files.insert()
                .values(
                name=name,
                user_id=user_id,
                parent_name=directory,
                content=content,
            )
            .on_conflict_do_update(constraint="uix_filepath_username", set_={
                "name": name,
                "user_id": user_id,
                "parent_name": directory,
                "content": content
            })
        )

@ssanderson
Copy link
Contributor

ssanderson commented Oct 14, 2021

I was under the assumption that the select and insert/update would happen in a single transaction, because of the db.begin_nested() in the with statement.

This is true. These will both happen in a single transaction, however,

Hence, if two transaction are trying to save (a new notebook) simultaneously, one would win, the other would need to retry a save and get updated in the second attempt.

Having these happen in separate transactions doesn't prevent the race I'm concerned about. Suppose that there are two writers, A and B, that are each trying to write a new notebook to the same location. The following sequence of events is possible:

  • Writer A calls _file_exists, which issues a query to the database checking if the desired file already exists.
  • Concurrently, writer B also calls _file_exists.
  • Writer A's call finishes, reporting that the file doesn't exist.
  • Writer B's call finishes, reporting that the file doesn't exist.
  • Both writers try to write the file. One of them will succeed, and the other will get a unique key violation, which will now be unhandled (this is what we were trying to handle previously by catching IntegrityError.

This is a pretty common race for insert-or-update operations, and the usual way to fix it (without having dedicated "upsert" operations) to try the operation and handle the error if it occurs, which is what this code was trying to do previously. I'm not sure what's going wrong though. The error suggests that we might need to exit the begin_nested block (which corresponds to a SAVEPOINT/ROLLBACK pair), but I think it'd probably be better to use the proper upsert operation now that it exists in pretty much all reasonable versions of postgres.

@ssanderson
Copy link
Contributor

On the upsert: very interesting. My initial attempt at using the on_conflict_do_update fails with Insert does not have an attribute on_conflict_do_update. This is the code fragment:

I think the issue here is that on_conflict_do_update is a postgres-specific operation. Sqlalchemy puts database-specific functionality in special "dialect" modules so that you know you're doing something that might not be portable across databases. Instead of doing files.insert, I think we need to do:

# This creates an `Insert` object that supports postgres-specific functionality.
from sqlalchemy.dialects.postgresql import insert
db.execute(insert(files).values(...).on_conflict_do_update(...)

Using the upsert feature (on_conflict_do_update) of
SQLAlchemy and PostgreSQL, instead of relying on
exception handling of an insert statement.

Upsert statements are support from PostgreSQL 9.5
https://www.postgresql.org/about/press/presskit95/en/
https://www.postgresql.org/docs/current/sql-insert.html
@jorisgillis
Copy link
Author

Thanks for the suggestion. I got the upsert working with SQLAlchemy. What do you think?

@ssanderson
Copy link
Contributor

@jorisgillis these changes look good to me! Unfortunately, it appears that I no longer have permission to merge PRs to this repo. Quantopian, the company for whom I originally wrote this plugin, no longer exists, so unfortunately many of their open source repos are in a bit of a state of flux. I'll see if I can contact someone about getting permissions back on this repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants