Fix double execution after concurrent node bootstraps #380

sylwiaszunejko · 2024-10-01T14:43:14Z

In some cases we want to only update the pool if previous do not exist or is shutdown. This commit adds additional validation to add_or_renew_pool to make sure this condition is met.

Problematic scenario:

We boot 2 Scylla nodes concurrently into existing cluster.
Python driver obtains two on_add notifications, one for each node.
Each notification calls add_or_renew_pool, which creates connection pool to each node.

But then, for some reason, one of the on_adds (let's say it is for node 1) may cause another add_or_renew_pool to be called for the other server (let's call it node 2). This happens from _finalize_add -> update_created_pools.

The reason add_or_renew_pool was executed second time is:

python-driver/cassandra/cluster.py

Lines 3359 to 3364 in d768d74

    
           if not pool or pool.is_shutdown: 
        
               # we don't eagerly set is_up on previously ignored hosts. None is included here 
        
               # to allow us to attempt connections to hosts that have gone from ignored to something 
        
               # else. 
        
               if distance != HostDistance.IGNORED and host.is_up in (True, None): 
        
                   future = self.add_or_renew_pool(host, False)

in this place pool for the node 2 is None. That is because add_or_renew_pool was called, but it does not finished yet (

python-driver/cassandra/cluster.py

Line 3324 in d768d74

self._pools[host] = new_pool

), new pool was created but not assigned to self._pools[host]. This is why adding sleep before that line make the issue easier to reproduce (022a3aa)

This may cause a second pool to be created for the other server and the initially established pool to that server to be closed.
There could be a statement running on the initially established pool. The statement may have already been executed on Scylla side, but the driver didn't get a response yet.
The pool is closed before response arrives. This causes driver to retry the statement on the new pool, leading to double execution.

So in that specific case we only want to create new pool if there was no pool before or it was shutdown. But the case might be that the pool was created but not assigned yet, so before assigning second one we want to check if previous met the conditions.

Fixes: #317

sylwiaszunejko · 2024-10-07T06:51:30Z

@dkropachev @Lorak-mmk could you take a look?

Lorak-mmk · 2024-10-07T14:27:17Z

Could you explain more? I'm not that familiar with this code - and in general you should not expect reviewers to be as familiar with a piece of code as you are.
In the issue Kamil explained how the bug happens - please describe how this scenario is executed after your change and why your change stops the bug.

Some more specific questions regarding the code:

You said:

We want to only update the pool if previous do not exist or is shutdown.

But in the code I see that require_previous_shutdown_or_none=False in 3 out of 4 calls - why?

Change in add_or_renew_pool: do I see correctly that first you create new pool and only then you check if it should really be created (and shutdown if not)?
This looks like a waste of resources, why open those connections at all? Is there a reason for that?

sylwiaszunejko · 2024-10-07T15:06:31Z

@Lorak-mmk I added some extra details to the description

In some cases We want to only update the pool if previous do not exist or is shutdown. This commit adds additional validation to add_or_renew_pool to make sure this condition is met when needed. Fixes: scylladb#317

sylwiaszunejko · 2024-10-08T06:29:49Z

Rebased on master and reworded the commit message

Lorak-mmk · 2024-10-08T17:12:57Z

Please tell me if I understand everything correctly.
add_or_renew_pool is an "async" function - it checks if the host is ignored (by what? LBP? I don't remember what _profile_manager is...), and if not then it schedules the actual function, run_add_or_renew_pool, to be executed on internal thread pool (which may have more than 1 thread, so functions executed on it still need to care about concurrency).

run_add_or_renew_pool, as the name suggests, creates or recreates (if one already exists) a connection pool to given host and puts it into Session._pools and shuts down old pool.
What this function does:

It tries to create a new connection pool to given host. Interestingly there are different paths for protocols >=3 (HostConnection) and < 3 (HostConnectionPool) - perhaps we could get rid of second one as we don't use protocol 2 or 1?
It retrieves current pool open to given host
It takes a Session-scoped lock
It enters a loop that makes sure keyspace of new pool is the same as keyspace of session: while new_pool._keyspace != self.keyspace:. Question: Why is it needed? From what I can see both HostConnection.__init__ and HostConnectionPool.__init__ already set keyspace. Is it possible for them to somehow fail at that? Or maybe it's possible to keyspace on the Session object to change between pool creation and this point?
Inside a loop (so if keyspace of pool is different than the one of Session) it releases a lock, tries to change keyspace of the new session, and re-acquires a lock.
It puts new pool in Session._pools and releases Session lock.
It closes previous pool (fetched in step 2) if it exists.

Given the description of run_add_or_renew_pool that I wrote above the list of steps (and assuming it is correct), I see some problems with this function.

As I stated, it does need to care about concurrency - and it probably does try, judging by the existence of a lock. There are however scenarios where it fails. Consider 2 concurrent calls (C1, C2) with the same host argument:

C1 Creates a new pool, and retrieves previous and acquires a lock. If the keyspace of the newly created pool is different than the one in Session it releases the lock and tries to fix it. During this time C2 can create pool, fetch previous and update Session._pools. Now C1 finished fixing keyspace, updates Session._pools and closes it's own previous - which is different than the pool created by C2, which will not be shutdown
Even if we assume that keyspace of new pool is never different than on Session this problem still exists: C1 and C2 create new pools, both fetch the same previous (because fetching it is not protected by a lock), and then under the lock update Session._pools and shutdown previous: again one of the new pools will not be shutdown.

I think that we should try to make this function safe (which means properly handling concurrency) - I don't see how we can be sure that anything works when such footguns are present.

~~Now regarding the issue at hand, you wrote:~~

~~That is because add_or_renew_pool was called, but it does not finished yet (~~

from what I can see in Cluster.on_add, _finalize_add is called only after run_add_or_renew_pool finished its work. If that is the case, then how could we enter the if not pool or pool.is_shutdown: branch in update_created_pools?
~~Either _finalize_add (and thus update_created_pools) is called before run_add_or_renew_pool finishes work - which is a bug and needs a different fix than this PR,~~
~~or the scenario is different and I'm misunderstanding something - in that case please explain more because I can't approve what I don't understand.~~

Okay, I got it after I wrote above paragraph. run_add_or_renew_pool from first on_add finishes and calls update_created_pools, but pool from second add is not yet ready.

So the underlying issue seems to be that the code assumes that on_add is called rarely enough that all the handling can be pretty much sequential. This is no longer the case, so the assumptions are violated.
In general, I doubt we can fix this simply. We can put band-aid on one race condition, but surely others will appear - because the underlying problem is that pool/metadata management in this driver is horrible mess of thread pools, callbacks and random locks sprinkled in some places that no human can properly understand.
IMO we should move to a model from Rust Driver (I think Java Driver does something very similar): there should be a thread (or asyncio task) that manages the cluster and pools. Driver communicates with it using messages. Then such task doesn't have to worry about concurrency because it is the only owner of the data. Apart from that each connection pool can have it's own task that manages the pool.

This model is easy to understand for developer that reads the code and nullifies most of the problem that we see here in Python Driver.

Now I assume we can't wait with fixing the linked issue until we can properly fix pool management as I wrote above - so let's move to your patch.

I don't see how this flag fixes anything. Currently there is already a check if previous pool is none or shutdown, in update_created_pools, see the first line of this snippet:

python-driver/cassandra/cluster.py

Lines 3359 to 3364 in d768d74

    
           if not pool or pool.is_shutdown: 
        
               # we don't eagerly set is_up on previously ignored hosts. None is included here 
        
               # to allow us to attempt connections to hosts that have gone from ignored to something 
        
               # else. 
        
               if distance != HostDistance.IGNORED and host.is_up in (True, None): 
        
                   future = self.add_or_renew_pool(host, False)

You are performing the exact same check, just a bit later (after you created a new pool) - it doesn't fix anything (because there is no guarantee that assignment to Session._pools already happened at this point), but reduces the chances of the race happening (moving the check after creating new pool is basically adding a sleep to prevent this race).
Consider the following scenario:

C1 is in run_add_or_renew_pool, but didn't assign Session._pools yet.
C2 is in update_created_pools. It calls add_or_renew_pool, which creates a new pool, and assigns previous = self._pools.get(host).
Let's assume that we never go inside this loop: while new_pool._keyspace != self.keyspace: because it only makes analysis harder.
Now it doesn't matter if C1 or C2 take the lock and assign Session._pools first: for both of them previous is None, so they will both assign to Session._pools, regardless of values of your flag.
Note: one of those pools won't be shutdown.
The bug still exists.

I think if you change the reproducer to add sleep before assigning to self._pools, but only if your flag is false, it may reproduce again - but I'm not completely sure. It may require more changes.

sylwiaszunejko · 2024-10-09T06:04:06Z

@Lorak-mmk You are right, I agree it only reduces the chances of a failure. Thanks for analyzing this carefully

sylwiaszunejko · 2024-10-09T06:47:25Z

Do we have an issue for fixing pool management?

Lorak-mmk · 2024-10-09T06:51:14Z

Do we have an issue for fixing pool management?

I don't think we do.

sylwiaszunejko · 2024-10-09T08:17:20Z

@Lorak-mmk @dkropachev To be honest I don't have any idea how to solve the issue differently. Do you have any idea for the fix? If not as this solution was proven to be incomplete I will close this PR. Additionally I will create an issue for fixing pool management and as the original issue is not that urgent (#317 (comment)) I think the priority should be lower to P2 and maybe it can wait for a proper fix.

Lorak-mmk · 2024-10-09T08:43:15Z

One possible idea:

Check why the code in run_add_or_renew_pool checks for keyspace inequality. If its a leftover and not possible to happen we can remove it. If it can happen, think if code can be changed somehow to omit this check. If we remove this code, then there is no lock releasing and relocking.
Check why previous = self._pools.get(host) is not done under the lock and if it can be moved inside the lock
If you succeed in 1 and 2 then we have a single atomic step that fetches previous pool and updates it (and then shuts down previous loop) - which is a rough fix for the issue
Add some mechanism to avoid creating additional pool that waste resources - we can discuss how to do this if you succeed in 1, 2 and 3.

If 1 can't be solved, but 2 can, then perhaps we can re-fetch previous after acquiring the lock, to keep it updated - it should be ok too, but I'm not sure.

Lorak-mmk · 2024-10-09T08:49:21Z

Hmm... I'm not sure if what I wrote makes sense, I have to think about it more.

Lorak-mmk · 2024-10-09T09:00:47Z

Ok, so points 1 and 2 are good in general, but are not enough to fix the issue - two pools would still be opened, and to assignments would happen.
We could additionally have some dict that indicates if such operation is already in progress (s basically my point 4). I think it should work and prevent this scenario, but I'm not 100% sure - and this is still a bandaid for this specific scenario, not a proper fix.

sylwiaszunejko · 2024-11-14T08:38:58Z

I am closing this PR as this is not correct solution, the issue would need a separate one utilizing some of the ideas probably

sylwiaszunejko requested review from Lorak-mmk and dkropachev October 1, 2024 14:43

sylwiaszunejko self-assigned this Oct 1, 2024

sylwiaszunejko mentioned this pull request Oct 1, 2024

Connection pool renewal after concurrent node bootstraps causes double statement execution #317

Open

sylwiaszunejko force-pushed the fix_double_execution branch 3 times, most recently from 2d9bbfe to be1debf Compare October 8, 2024 06:28

sylwiaszunejko added 2 commits October 8, 2024 08:28

Do not update pool if already created

4b73e8d

In some cases We want to only update the pool if previous do not exist or is shutdown. This commit adds additional validation to add_or_renew_pool to make sure this condition is met when needed. Fixes: scylladb#317

Fix whitespaces

61aaeed

sylwiaszunejko force-pushed the fix_double_execution branch from be1debf to 61aaeed Compare October 8, 2024 06:29

dkropachev requested review from dkropachev and removed request for dkropachev October 9, 2024 09:03

sylwiaszunejko mentioned this pull request Oct 14, 2024

Fix pool management #382

Open

sylwiaszunejko closed this Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double execution after concurrent node bootstraps #380

Fix double execution after concurrent node bootstraps #380

sylwiaszunejko commented Oct 1, 2024 •

edited

Loading

sylwiaszunejko commented Oct 7, 2024

Lorak-mmk commented Oct 7, 2024

sylwiaszunejko commented Oct 7, 2024

sylwiaszunejko commented Oct 8, 2024

Lorak-mmk commented Oct 8, 2024 •

edited

Loading

sylwiaszunejko commented Oct 9, 2024 •

edited

Loading

sylwiaszunejko commented Oct 9, 2024

Lorak-mmk commented Oct 9, 2024

sylwiaszunejko commented Oct 9, 2024

Lorak-mmk commented Oct 9, 2024

Lorak-mmk commented Oct 9, 2024

Lorak-mmk commented Oct 9, 2024

sylwiaszunejko commented Nov 14, 2024

	if not pool or pool.is_shutdown:
	# we don't eagerly set is_up on previously ignored hosts. None is included here
	# to allow us to attempt connections to hosts that have gone from ignored to something
	# else.
	if distance != HostDistance.IGNORED and host.is_up in (True, None):
	future = self.add_or_renew_pool(host, False)

Fix double execution after concurrent node bootstraps #380

Fix double execution after concurrent node bootstraps #380

Conversation

sylwiaszunejko commented Oct 1, 2024 • edited Loading

sylwiaszunejko commented Oct 7, 2024

Lorak-mmk commented Oct 7, 2024

sylwiaszunejko commented Oct 7, 2024

sylwiaszunejko commented Oct 8, 2024

Lorak-mmk commented Oct 8, 2024 • edited Loading

sylwiaszunejko commented Oct 9, 2024 • edited Loading

sylwiaszunejko commented Oct 9, 2024

Lorak-mmk commented Oct 9, 2024

sylwiaszunejko commented Oct 9, 2024

Lorak-mmk commented Oct 9, 2024

Lorak-mmk commented Oct 9, 2024

Lorak-mmk commented Oct 9, 2024

sylwiaszunejko commented Nov 14, 2024

sylwiaszunejko commented Oct 1, 2024 •

edited

Loading

Lorak-mmk commented Oct 8, 2024 •

edited

Loading

sylwiaszunejko commented Oct 9, 2024 •

edited

Loading