Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry file locks after delay in case of failure #28544

Closed
wants to merge 1 commit into from
Closed

Conversation

PVince81
Copy link
Contributor

Description

Adds a decorator to retry acquiring file locks after a delay in case of failure.

Related Issue

Fixes #17016

Motivation and Context

See issue.
Basically some cron tasks or concurrent tasks might prevent an expensive upload to finish due to a lock. Instead of abandoning directly, the code tries again shortly after to give a few more chances to finish the transaction and avoid having to redo it.

In my personal case, it often happens that I upload files with Android around at a round quarter of an hour which conflicts with cron run, and often times the upload fails with locking issues so I need to retry manually.

How Has This Been Tested?

  • unit tests
  • need manual concurrent testing / smashbox

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@PVince81
Copy link
Contributor Author

⚠️ needs more concurrent testing!

@PVince81
Copy link
Contributor Author

If we agree with the extra settings I'll add them to config.sample.php

* @param int $retries number of retries before giving up
* @param int $retryDelay delay to wait between retries, in milliseconds
*/
public function __construct($provider, $retries = 5, $retryDelay = 1000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type hinting for the provider

}

/**
* {@inheritdoc}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@jvillafanez
Copy link
Member

I'm not sure if it's worthy to parametrize the methods that will be retried in the RetryLockingProvider. If the releaseLock and releaseAll method won't need to be retried, I guess it's fine as it is now, no need to make things complex.

@PVince81
Copy link
Contributor Author

PVince81 commented Aug 1, 2017

I'm not sure if it's worthy to parametrize the methods that will be retried in the RetryLockingProvider. If the releaseLock and releaseAll method won't need to be retried, I guess it's fine as it is now, no need to make things complex.

I just wanted to avoid duplicating the code of the while loop logic.

@PVince81
Copy link
Contributor Author

PVince81 commented Aug 1, 2017

fixed the indents.

I've also added the new params in config.sample.php

@jvillafanez
Copy link
Member

Ok, code looks good 👍

@PVince81
Copy link
Contributor Author

PVince81 commented Aug 2, 2017

WTF, I tried to test this and now it seems both processes get a lock...

Here is how I did it:

  1. Disable versions to avoid polluting your storage with versions (would require running cron in a loop): occ app:disable files_versions
  2. In one terminal: X=0; while test $X -eq 0; do curl -D - -u admin:admin -X PUT --data-binary "@data1.dat" http://localhost/owncloud/remote.php/dav/files/admin/x/data1.dat -f; X=$?; echo $X; done
  3. In another terminal, with cadaver, manually and repeatedly: dav:/owncloud/remote.php/webdav/x/> put bacon.txt data1.dat.

Sometimes one of the processes gets 423 Locked, which is fine.
I can see that the cadaver window is sometimes waiting for the lock to free itself.

However, too often, I see that both processes get 423 Locked... Maybe I need to randomize the wait time ?

@PVince81
Copy link
Contributor Author

PVince81 commented Aug 2, 2017

Happens here:

0  OC\Lock\RetryLockingProvider->callAndRetry() /srv/www/htdocs/owncloud/lib/private/Lock/RetryLockingProvider.php:113
1  OC\Lock\RetryLockingProvider->changeLock() /srv/www/htdocs/owncloud/lib/private/Lock/RetryLockingProvider.php:88
2  OC\Files\Storage\Home->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/Storage/Common.php:671
3  OC\Files\Storage\Wrapper\Encryption->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/Storage/Wrapper/Wrapper.php:613
4  OC\Files\Storage\Wrapper\Checksum->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/Storage/Wrapper/Wrapper.php:613
5  OCA\Files_Trashbin\Storage->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/Storage/Wrapper/Wrapper.php:613
6  OC\Files\View->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/View.php:1945
7  OCA\DAV\Connector\Sabre\File->changeLock() /srv/www/htdocs/owncloud/apps/dav/lib/Connector/Sabre/Node.php:366
8  OCA\DAV\Connector\Sabre\File->put() /srv/www/htdocs/owncloud/apps/dav/lib/Connector/Sabre/File.php:195
9  OCA\DAV\Connector\Sabre\Server->updateFile() /srv/www/htdocs/owncloud/lib/composer/sabre/dav/lib/DAV/Server.php:1129
10 Sabre\DAV\CorePlugin->httpPut() /srv/www/htdocs/owncloud/lib/composer/sabre/dav/lib/DAV/CorePlugin.php:513
11 call_user_func_array:{/srv/www/htdocs/owncloud/lib/composer/sabre/event/lib/EventEmitterTrait.php:105}() /srv/www/htdocs/owncloud/lib/composer/sabre/event/lib/EventEmitterTrait.php:105
12 OCA\DAV\Connector\Sabre\Server->emit() /srv/www/htdocs/owncloud/lib/composer/sabre/event/lib/EventEmitterTrait.php:105
13 OCA\DAV\Connector\Sabre\Server->invokeMethod() /srv/www/htdocs/owncloud/lib/composer/sabre/dav/lib/DAV/Server.php:479
14 OCA\DAV\Connector\Sabre\Server->exec() /srv/www/htdocs/owncloud/lib/composer/sabre/dav/lib/DAV/Server.php:254
15 require_once()  /srv/www/htdocs/owncloud/apps/dav/appinfo/v1/webdav.php:63
16 {main}          /srv/www/htdocs/owncloud/remote.php:165

Both processes are trying to get an exclusive lock to finish writing the file (the rename from part file to final file). But the database shows that there is already an exclusive lock there:

MariaDB [owncloud]> select * from oc_file_locks where `lock` > 0;
+----+------+----------------------------------------+------------+
| id | lock | key                                    | ttl        |
+----+------+----------------------------------------+------------+
| 80 |    2 | files/acc49da7b925ff4ba84a9ea48fd2ad26 | 1501676972 |
| 81 |    2 | files/9688085ed76345fbabcce6dc3027772c | 1501676972 |
+----+------+----------------------------------------+------------+

The file in question is "files/9688085ed76345fbabcce6dc3027772c".

It could be related to the fact that we first set shared locks, then exclusive lock.
Or the fact that we set a lock on all parents as well.
And maybe that causes the two processes to lock each other out.

Needs further research.

⚠️ don't merge yet because the above makes this PR rather useless as it will cause an additional delay with no benefit...

@jvillafanez
Copy link
Member

Maybe the DB provider doesn't use atomic operations to implement the locks. If the operations aren't atomic, race conditions could happen.

@PVince81
Copy link
Contributor Author

PVince81 commented Aug 2, 2017

Next up: try with redis to see whether it's the DB causing trouble

@PVince81
Copy link
Contributor Author

Looks like setting the delay to 5 seconds (5000) helps to solve problems like #28779.

So it does work.

@PVince81
Copy link
Contributor Author

Still doesn't explain why my previous test cases had both processes get a lock and exclude each other. Probably due to both setting a shared lock on it first before going to exclusive lock, and of course when retrying none of both will remove the shared lock first.

@PVince81
Copy link
Contributor Author

rebased and increased default delay value to 5000

@PVince81 PVince81 modified the milestones: planned, triage Aug 29, 2017
@PVince81
Copy link
Contributor Author

Turns out that the new DAV endpoint is locking too much and parallel requests will cause lock outs: #28779

This might explain what I observed here when testing.

Best would be to retest with the old dav endpoint to validate this PR.

@PVince81
Copy link
Contributor Author

First test with old dav endpoint, same issue: both calls lock each other out.

Back to the drawing board then...

@PVince81
Copy link
Contributor Author

Considering that this was not reproducible with Redis, I wonder if Redis already has a waiting system internally. If yes we'd only need the above logic for DB locking. But it would also mean that advising people to use Redis is the better way.

@PVince81 PVince81 modified the milestones: development, triage Nov 3, 2017
@PVince81
Copy link
Contributor Author

PVince81 commented Feb 7, 2018

abandoning this. feel free to take over

@PVince81 PVince81 closed this Feb 7, 2018
@felixboehm felixboehm removed this from the triage milestone Apr 10, 2018
@PVince81 PVince81 deleted the retrylocking branch September 27, 2018 13:36
@lock lock bot locked as resolved and limited conversation to collaborators Sep 27, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Retry lock acquiring after delay
4 participants