Retry file locks after delay in case of failure #28544

PVince81 · 2017-07-31T14:39:53Z

Description

Adds a decorator to retry acquiring file locks after a delay in case of failure.

Related Issue

Motivation and Context

See issue.
Basically some cron tasks or concurrent tasks might prevent an expensive upload to finish due to a lock. Instead of abandoning directly, the code tries again shortly after to give a few more chances to finish the transaction and avoid having to redo it.

In my personal case, it often happens that I upload files with Android around at a round quarter of an hour which conflicts with cron run, and often times the upload fails with locking issues so I need to retry manually.

How Has This Been Tested?

unit tests
need manual concurrent testing / smashbox

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.

PVince81 · 2017-07-31T14:40:13Z

⚠️ needs more concurrent testing!

PVince81 · 2017-07-31T14:47:11Z

If we agree with the extra settings I'll add them to config.sample.php

jvillafanez · 2017-08-01T07:38:14Z

lib/private/Lock/RetryLockingProvider.php

+	 * @param int $retries number of retries before giving up
+	 * @param int $retryDelay delay to wait between retries, in milliseconds
+	 */
+	public function __construct($provider, $retries = 5, $retryDelay = 1000) {


Type hinting for the provider

jvillafanez · 2017-08-01T07:40:42Z

lib/private/Lock/RetryLockingProvider.php

+	}
+
+	/**
+     * {@inheritdoc}


indentation

jvillafanez · 2017-08-01T07:43:22Z

I'm not sure if it's worthy to parametrize the methods that will be retried in the RetryLockingProvider. If the releaseLock and releaseAll method won't need to be retried, I guess it's fine as it is now, no need to make things complex.

PVince81 · 2017-08-01T07:46:50Z

I'm not sure if it's worthy to parametrize the methods that will be retried in the RetryLockingProvider. If the releaseLock and releaseAll method won't need to be retried, I guess it's fine as it is now, no need to make things complex.

I just wanted to avoid duplicating the code of the while loop logic.

PVince81 · 2017-08-01T07:49:47Z

fixed the indents.

I've also added the new params in config.sample.php

jvillafanez · 2017-08-01T08:08:53Z

Ok, code looks good 👍

PVince81 · 2017-08-02T11:18:09Z

WTF, I tried to test this and now it seems both processes get a lock...

Here is how I did it:

Disable versions to avoid polluting your storage with versions (would require running cron in a loop): occ app:disable files_versions
In one terminal: X=0; while test $X -eq 0; do curl -D - -u admin:admin -X PUT --data-binary "@data1.dat" http://localhost/owncloud/remote.php/dav/files/admin/x/data1.dat -f; X=$?; echo $X; done
In another terminal, with cadaver, manually and repeatedly: dav:/owncloud/remote.php/webdav/x/> put bacon.txt data1.dat.

Sometimes one of the processes gets 423 Locked, which is fine.
I can see that the cadaver window is sometimes waiting for the lock to free itself.

However, too often, I see that both processes get 423 Locked... Maybe I need to randomize the wait time ?

PVince81 · 2017-08-02T11:33:39Z

Happens here:

0  OC\Lock\RetryLockingProvider->callAndRetry() /srv/www/htdocs/owncloud/lib/private/Lock/RetryLockingProvider.php:113
1  OC\Lock\RetryLockingProvider->changeLock() /srv/www/htdocs/owncloud/lib/private/Lock/RetryLockingProvider.php:88
2  OC\Files\Storage\Home->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/Storage/Common.php:671
3  OC\Files\Storage\Wrapper\Encryption->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/Storage/Wrapper/Wrapper.php:613
4  OC\Files\Storage\Wrapper\Checksum->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/Storage/Wrapper/Wrapper.php:613
5  OCA\Files_Trashbin\Storage->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/Storage/Wrapper/Wrapper.php:613
6  OC\Files\View->changeLock() /srv/www/htdocs/owncloud/lib/private/Files/View.php:1945
7  OCA\DAV\Connector\Sabre\File->changeLock() /srv/www/htdocs/owncloud/apps/dav/lib/Connector/Sabre/Node.php:366
8  OCA\DAV\Connector\Sabre\File->put() /srv/www/htdocs/owncloud/apps/dav/lib/Connector/Sabre/File.php:195
9  OCA\DAV\Connector\Sabre\Server->updateFile() /srv/www/htdocs/owncloud/lib/composer/sabre/dav/lib/DAV/Server.php:1129
10 Sabre\DAV\CorePlugin->httpPut() /srv/www/htdocs/owncloud/lib/composer/sabre/dav/lib/DAV/CorePlugin.php:513
11 call_user_func_array:{/srv/www/htdocs/owncloud/lib/composer/sabre/event/lib/EventEmitterTrait.php:105}() /srv/www/htdocs/owncloud/lib/composer/sabre/event/lib/EventEmitterTrait.php:105
12 OCA\DAV\Connector\Sabre\Server->emit() /srv/www/htdocs/owncloud/lib/composer/sabre/event/lib/EventEmitterTrait.php:105
13 OCA\DAV\Connector\Sabre\Server->invokeMethod() /srv/www/htdocs/owncloud/lib/composer/sabre/dav/lib/DAV/Server.php:479
14 OCA\DAV\Connector\Sabre\Server->exec() /srv/www/htdocs/owncloud/lib/composer/sabre/dav/lib/DAV/Server.php:254
15 require_once()  /srv/www/htdocs/owncloud/apps/dav/appinfo/v1/webdav.php:63
16 {main}          /srv/www/htdocs/owncloud/remote.php:165

Both processes are trying to get an exclusive lock to finish writing the file (the rename from part file to final file). But the database shows that there is already an exclusive lock there:

MariaDB [owncloud]> select * from oc_file_locks where `lock` > 0;
+----+------+----------------------------------------+------------+
| id | lock | key                                    | ttl        |
+----+------+----------------------------------------+------------+
| 80 |    2 | files/acc49da7b925ff4ba84a9ea48fd2ad26 | 1501676972 |
| 81 |    2 | files/9688085ed76345fbabcce6dc3027772c | 1501676972 |
+----+------+----------------------------------------+------------+

The file in question is "files/9688085ed76345fbabcce6dc3027772c".

It could be related to the fact that we first set shared locks, then exclusive lock.
Or the fact that we set a lock on all parents as well.
And maybe that causes the two processes to lock each other out.

Needs further research.

⚠️ don't merge yet because the above makes this PR rather useless as it will cause an additional delay with no benefit...

jvillafanez · 2017-08-02T13:58:01Z

Maybe the DB provider doesn't use atomic operations to implement the locks. If the operations aren't atomic, race conditions could happen.

PVince81 · 2017-08-02T14:19:13Z

Next up: try with redis to see whether it's the DB causing trouble

PVince81 · 2017-08-25T15:17:03Z

Looks like setting the delay to 5 seconds (5000) helps to solve problems like #28779.

So it does work.

PVince81 · 2017-08-25T15:18:46Z

Still doesn't explain why my previous test cases had both processes get a lock and exclude each other. Probably due to both setting a shared lock on it first before going to exclusive lock, and of course when retrying none of both will remove the shared lock first.

PVince81 · 2017-08-25T15:21:22Z

rebased and increased default delay value to 5000

PVince81 · 2017-08-30T12:31:55Z

Turns out that the new DAV endpoint is locking too much and parallel requests will cause lock outs: #28779

This might explain what I observed here when testing.

Best would be to retest with the old dav endpoint to validate this PR.

PVince81 · 2017-09-11T14:20:24Z

First test with old dav endpoint, same issue: both calls lock each other out.

Back to the drawing board then...

PVince81 · 2017-10-27T08:14:57Z

Considering that this was not reproducible with Redis, I wonder if Redis already has a waiting system internally. If yes we'd only need the above logic for DB locking. But it would also mean that advising people to use Redis is the better way.

PVince81 · 2018-02-07T11:45:50Z

abandoning this. feel free to take over

PVince81 added the 3 - To Review label Jul 31, 2017

PVince81 added this to the development milestone Jul 31, 2017

PVince81 requested review from tomneedham, butonic, DeepDiver1975 and jvillafanez July 31, 2017 14:39

PVince81 mentioned this pull request Jul 31, 2017

Retry lock acquiring after delay #17016

Closed

PVince81 mentioned this pull request Jul 31, 2017

Add repair step to repair mismatch filecache paths #28253

Merged

11 tasks

jvillafanez reviewed Aug 1, 2017

View reviewed changes

PVince81 force-pushed the retrylocking branch from 80e7ed6 to 7186cd2 Compare August 1, 2017 07:49

PVince81 added 2 - Developing and removed 3 - To Review labels Aug 2, 2017

PVince81 modified the milestones: triage, development Aug 3, 2017

PVince81 mentioned this pull request Aug 24, 2017

Deletion of several folders or files fails when using external storages. #28779

Closed

Retry file locks after delay in case of failure

2a4e82a

PVince81 force-pushed the retrylocking branch from 7186cd2 to 2a4e82a Compare August 25, 2017 15:21

PVince81 modified the milestones: planned, triage Aug 29, 2017

PVince81 modified the milestones: development, triage Nov 3, 2017

ownclouders added the status/STALE label Dec 7, 2017

PVince81 closed this Feb 7, 2018

ownclouders removed the status/STALE label Feb 7, 2018

felixboehm removed this from the triage milestone Apr 10, 2018

PVince81 deleted the retrylocking branch September 27, 2018 13:36

lock bot locked as resolved and limited conversation to collaborators Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry file locks after delay in case of failure #28544

Retry file locks after delay in case of failure #28544

PVince81 commented Jul 31, 2017

PVince81 commented Jul 31, 2017

PVince81 commented Jul 31, 2017

jvillafanez Aug 1, 2017

jvillafanez Aug 1, 2017

jvillafanez commented Aug 1, 2017

PVince81 commented Aug 1, 2017

PVince81 commented Aug 1, 2017

jvillafanez commented Aug 1, 2017

PVince81 commented Aug 2, 2017

PVince81 commented Aug 2, 2017

jvillafanez commented Aug 2, 2017

PVince81 commented Aug 2, 2017

PVince81 commented Aug 25, 2017

PVince81 commented Aug 25, 2017

PVince81 commented Aug 25, 2017

PVince81 commented Aug 30, 2017

PVince81 commented Sep 11, 2017

PVince81 commented Oct 27, 2017

PVince81 commented Feb 7, 2018

Retry file locks after delay in case of failure #28544

Retry file locks after delay in case of failure #28544

Conversation

PVince81 commented Jul 31, 2017

Description

Related Issue

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

PVince81 commented Jul 31, 2017

PVince81 commented Jul 31, 2017

jvillafanez Aug 1, 2017

Choose a reason for hiding this comment

jvillafanez Aug 1, 2017

Choose a reason for hiding this comment

jvillafanez commented Aug 1, 2017

PVince81 commented Aug 1, 2017

PVince81 commented Aug 1, 2017

jvillafanez commented Aug 1, 2017

PVince81 commented Aug 2, 2017

PVince81 commented Aug 2, 2017

jvillafanez commented Aug 2, 2017

PVince81 commented Aug 2, 2017

PVince81 commented Aug 25, 2017

PVince81 commented Aug 25, 2017

PVince81 commented Aug 25, 2017

PVince81 commented Aug 30, 2017

PVince81 commented Sep 11, 2017

PVince81 commented Oct 27, 2017

PVince81 commented Feb 7, 2018