Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PROPFIND for non-shared/shared by reducing the number of que… #27284

Merged
merged 1 commit into from
Mar 10, 2017

Conversation

mrow4a
Copy link
Contributor

@mrow4a mrow4a commented Mar 1, 2017

Hello guys, after exams I am back at work with a lot of new energy, knowledge and ideas. Lets start the fun with this PR.

The idea here is behavior of SharedPlugin, which I found out reviewing our code with BlackFire e.g:
This is example query if you have 100 NON-SHARED files, it put my attention on quering sharing table while we dont have shares there, and why dont we just 1 query or so..
selection_019

I have been interested in what causes that amount of share_table accesses and started optimizing the code:

@mention-bot
Copy link

@mrow4a, thanks for your PR! By analyzing the history of the files in this pull request, we identified @PVince81, @DeepDiver1975 and @phisch to be potential reviewers.

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 1, 2017

Results for the PROPFIND query with 1st optimization:

Case 1
-17 queries
selection_020

selection_021

Case 2
-300 queries
100 NON-SHARED files, just someone uploaded 100 files to the folder
selection_022

@DeepDiver1975 DeepDiver1975 added this to the 10.0 milestone Mar 1, 2017
* @param Node $path
* @param bool $reshares
* @return IShare[]
* @since 9.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10.0.0

Copy link
Contributor

@PVince81 PVince81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice step forward, this is going in the right direction I think.

Next step is to take an array of nodes so you can do where file_source in (...) (with array_chunk of 1000 blocks)

@@ -564,6 +564,16 @@ public function deleteFromSelf(IShare $share, $recipient) {
/**
* @inheritdoc
*/
public function getAllSharesBy($userId, $shareTypes, $node, $reshares) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's directly add $limit and $offset to avoid having to change the interface again in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is not interested in any offsets, as name states, ALL :>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALL meaning "the whole result set contains all the shares of that user" but it doesn't mean we shouldn't provide a way to paginate over said result set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the same semantics as if we say "get all users" from an install that has 10000 users, we need a way to paginate as we don't want to return 10000 entries as it would be slow, especially for the UI parts

$allShares = $this->shareManager->getAllSharesBy(
$this->userId,
$requestedShareTypes,
$node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

next step is taking an array of nodes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@DeepDiver1975
Copy link
Member

💥

OCA\DAV\Tests\unit\Connector\Sabre\SharesPluginTest::testGetProperties with data set #0 (array())
Invalid argument supplied for foreach()

/var/lib/jenkins/workspace/owncloud-core_core_PR-27284-5KMQXE2IARTR6EZOXEANYYNTQ5MB25RWDD7UWVSPM56ZJNO46SIA/apps/dav/lib/Connector/Sabre/SharesPlugin.php:128
/var/lib/jenkins/workspace/owncloud-core_core_PR-27284-5KMQXE2IARTR6EZOXEANYYNTQ5MB25RWDD7UWVSPM56ZJNO46SIA/apps/dav/lib/Connector/Sabre/SharesPlugin.php:171
/var/lib/jenkins/workspace/owncloud-core_core_PR-27284-5KMQXE2IARTR6EZOXEANYYNTQ5MB25RWDD7UWVSPM56ZJNO46SIA/lib/composer/sabre/dav/lib/DAV/PropFind.php:98
/var/lib/jenkins/workspace/owncloud-core_core_PR-27284-5KMQXE2IARTR6EZOXEANYYNTQ5MB25RWDD7UWVSPM56ZJNO46SIA/apps/dav/lib/Connector/Sabre/SharesPlugin.php:175
/var/lib/jenkins/workspace/owncloud-core_core_PR-27284-5KMQXE2IARTR6EZOXEANYYNTQ5MB25RWDD7UWVSPM56ZJNO46SIA/apps/dav/tests/unit/Connector/Sabre/SharesPluginTest.php:125

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 1, 2017

Thomas, I just been doing prove of concept, as discussed I will now adjust unit tests :> Wanted to see how much we can get

foreach ($allShares as $share) {
$shareType = $share->getShareType();
if (in_array($shareType, $requestedShareTypes)) {
$shareTypes[] = $shareType;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PVince81 we need uniqueness here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we do

Copy link
Contributor Author

@mrow4a mrow4a Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we did with @PVince81 using hashmap and array_keys to do it efficiently will push together with unit tests

@DeepDiver1975
Copy link
Member

@SergioBertolinSG @PVince81 do we have behat tests for this? if not we need to add them BEFORE this change - THX

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 3, 2017

For already obtained nodes through $folderNode->getDirectoryListing (do not replicate queries which already had to be done for other reasons), query all share-types using IN() clause, injecting nodeID predicates directly into the query. Results for the PROPFIND query with 2nd optimization:

Case 1
10 files, having all possible cases for sharing e.g. (folder share, file shares link, link+.., link+group+2xuser+feder)

reduced by 104 -> 59 queries and -15% shorter execution time

selection_030

Case 2
100 NON-SHARED files, just someone uploaded 100 files to the folder
reduced from 745 -> 244 queries and -48% shorter execution time
selection_031

If all checks pass, please switch to review @DeepDiver1975 @PVince81. Please also check with the customer to "tick" the last check from 1st post

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 3, 2017

@DeepDiver1975 Yes, please test it, later I will write more unit tests to check more cases e.g IN() chunking .

@SergioBertolinSG
Copy link
Contributor

@SergioBertolinSG @PVince81 do we have behat tests for this? if not we need to add them BEFORE this change - THX

What are the approximate steps here?

@PVince81
Copy link
Contributor

PVince81 commented Mar 6, 2017

Steps:

  1. Login as user user1
  2. Share with user/group/link
  3. PROPFIND with the "oc:share-types" property, verify that the value matches the share types

The share types are unique, so if the user had two outgoing user shares, the type "user" only appears once. This field is used for the web UI to display the correct share status icon in each file row.

Copy link
Contributor

@PVince81 PVince81 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress!

@@ -104,34 +104,31 @@ public function initialize(\Sabre\DAV\Server $server) {
}

/**
* Return a list of share types for outgoing shares
* Update cachedShareTypes for specific nodeIDs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the method is called get not update.

I suggest you keep it to be a get method and transparently handle the cache. The caller doesn't need to know about the caching.

Basically a construct like:

function getSomething($key) {
    if (!isset($this->cacheSomething[$key])) {
        $this->cacheSomething[$key] = $this->getTheExpensiveThing($key);
    }
    return $this->cacheSomething[$key];
}

This is what we're doing in a lot of the OC code so let's keep this pattern.

foreach ($allShares as $share) {
$currentNodeID = $share->getNodeId();
$currentShareType = $share->getShareType();
$this->cachedShareTypes[$currentNodeID][$currentShareType] = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need caching here ? Is this function called repeatedly ?


// Put node ID into an array and initialize cache for it
$nodeId = intval($childNode->getId());
array_push($nodeIdsArray, $nodeId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use $nodeIdsArray[] = $nodeId;. not sure if this makes a difference performance-wise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change it? isnt it equivalent?


// Put node ID into an array and initialize cache for it
$nodeId = intval($childNode->getId());
array_push($nodeIdsArray, $nodeId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use $nodeIdsArray[] = $nodeId;. not sure if this makes a difference performance-wise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change it? isnt it equivalent?

@@ -90,6 +90,18 @@ public function moveShare(IShare $share, $recipientId);
* Get shares shared by (initiated) by the provided user.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjust to say Get ALL shares

Copy link
Contributor Author

@mrow4a mrow4a Mar 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

// Cache share-types obtaining them from DB
$this->getNodesShareTypes($nodeIdsArray);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A get function that doesn't return anything. this will likely confuse other developers. Please make that function return something.

$nodeId = $this->userFolder->get($sabreNode->getPath())->getId();
$this->cachedShareTypes[$nodeId] = [];
$this->getNodesShareTypes([$nodeId]);
$shareTypesHash = $this->cachedShareTypes[$nodeId];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use the cache directly here and rely on the function doing that for you.

$shareTypesHash = $this->getNodesShareTypes([$nodeId]);

);
}

$qb->andWhere($qb->expr()->in('file_source', $qb->createParameter('file_source_ids')));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need chunking here with array_chunk (because some databases have a max limit for their IN operator...). See how it's done here: https://github.com/owncloud/core/blob/v9.1.4/lib/private/Share20/DefaultShareProvider.php#L930

Copy link
Contributor Author

@mrow4a mrow4a Mar 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did chunking in different part of the code, in Manager, should I change it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if another part of code written by a future dev calls getAllSharesBy ? It is better if chunking is done inside this to avoid every API consumer to have to think of doing chunking (which is likely to be duplicate code, if not forgotten)

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 6, 2017

@SergioBertolinSG I will today give you a full script to test it, initializing users/shares and profiling this etc.

@SergioBertolinSG
Copy link
Contributor

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 6, 2017

Ok Vincent I will add the code style you require and let's change label to review so they can test it

@PVince81
Copy link
Contributor

PVince81 commented Mar 6, 2017

@SergioBertolinSG thanks for the info, that's perfect.

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 6, 2017

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 6, 2017

@PVince81 please review fixes and decide about changing to Review and get it tested on real sized DB? @SergioBertolinSG @SamuAlfageme

@DeepDiver1975 is everything ok with Jenkins?

@PVince81
Copy link
Contributor

PVince81 commented Mar 6, 2017

rebase onto master to get rid of that JS failure

@PVince81
Copy link
Contributor

PVince81 commented Mar 7, 2017

now the test failures are related to your changes

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 7, 2017

We should probably also smashbox this PR against all tests, especially share related.

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 7, 2017

@DeepDiver1975 @PVince81 Please change flag to review to indicate the PR is ready

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 7, 2017

@DeepDiver1975 @PVince81 Ok, flag changed, I just need to fullfil last tick:

  • integration tests (also smashbox)
  • benchmark it on bigger instance @felixboehm

Copy link
Member

@DeepDiver1975 DeepDiver1975 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not yet sure if all changes to the unit tests are valid .... I need to have a closer look

@@ -529,9 +529,20 @@ public function testGetAllSharedByWithReshares() {
->setNode($node);
$this->provider->create($share2);

for($i = 0; $i < 200; $i++) {
$receiver = strval($i)."[email protected]";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it relevant? It is just test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well - as a student you are allowed to learn even from tests 😉

@@ -529,9 +529,20 @@ public function testGetAllSharedByWithReshares() {
->setNode($node);
$this->provider->create($share2);

for($i = 0; $i < 200; $i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you create 200 shared here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test chunking of Node IDs.


$orX = $qb->expr()->orX();
$nodeIdsChunks = array_chunk($nodeIDs, 100);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 100? should be 1000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in other parts of the code we do 100


$nodeIdsChunks = array_chunk($nodeIDs, 100);
foreach ($nodeIdsChunks as $nodeIdsChunk) {
$qb->select('*')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory you don't need to rebuild the whole query in this loop, you only need to re-set the value of the IN operator.

You can leave this as is now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, the andWhere calls will add duplicate where conditions to the query builder.

@PVince81
Copy link
Contributor

PVince81 commented Mar 7, 2017

Code looks good to me now 👍

According to @SergioBertolinSG we already have integration tests for the "oc:share-types" property so since the tests passed I think this is ready to merge.

It seems obvious to me that the performance will be significantly better and we anyway want this in 10.0. Perf testing can be done with tomorrow's daily after this is merged.

To be discussed: backporting.

@DeepDiver1975 any objections ?

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 7, 2017

General conclusion here is: if you will have questions from customers -> we have function $folderNode->getDirectoryListing() which basicaly does query to filecache and returns the nodes. For these nodes, we are quering sharing table. I considered Join with filecache here as well, because IN() clause is also kind of it, but JOIN makes no sense here because we would redo twice the work (doing "joining" twice), and $folderNode->getDirectoryListing() is necessary from what I can see. The only disadvantage here is that we have to do it "in chunks", so in case of >100 files in the folder, we need more indexed queries.

@PVince81
Copy link
Contributor

PVince81 commented Mar 7, 2017

@mrow4a I think getDirectoryListing here is out of scope for this PR. Please look into it in a separate ticket/PR.

*
* @param \OCP\Files\Node $node file node
* @param IShare[] array containing shares
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix phpdoc

*
* @return int[] array of share types
* @param int[] array of folder/file nodeIDs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix PHPDoc

@@ -462,6 +462,59 @@ public function move(\OCP\Share\IShare $share, $recipient) {
/**
* @inheritdoc
*/
public function getAllSharesBy($userId, $shareTypes, $nodeIDs, $reshares) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the interface method looks like this

Node $node vs nodeIds - please correct

	/**
	 * Get all shares by the given user for specified shareTypes array
	 *
	 * @param string $userId
	 * @param int[] $shareTypes
	 * @param Node|null $node
	 * @param bool $reshares Also get the shares where $user is the owner instead of just the shares where $user is the initiator
	 * @return \OCP\Share\IShare[]
	 * @since 10.0.0
	 */
	public function getAllSharesBy($userId, $shareTypes, $node, $reshares);

@mrow4a mrow4a force-pushed the optimize_propfind branch 3 times, most recently from c29cbfc to 274fe9b Compare March 9, 2017 22:20
@PVince81
Copy link
Contributor

Rebased.

Now Jenkins, I know it's Friday, but still please do your job!

…ries

For already obtained nodes through `$folderNode->getDirectoryListing()`, query all share-types using IN() clause, injecting nodeID predicates directly into the query

Full coverage with unit tests and fixes
@PVince81
Copy link
Contributor

Backporting would have been nice but there are interface additions which could break.
The only way to backport is to not add the new methods to the interfaces and add a "method_exists" check in the class that would call it and log an error if it doesn't, or fallback to to the old behavior.

@DeepDiver1975 @mrow4a thoughts ?

@mrow4a
Copy link
Contributor Author

mrow4a commented Mar 31, 2017

I consider backporting dark magic, but if it is possible why not :>

@phil-davis
Copy link
Contributor

Note: this code is in stable10 - there is no backport PR. I guess that the "backport" discussion above from 2017 is about backporting to v9.*, and that looks like it did not happen (which is OK).

It looks like there is no "backport to stable10" because the code was already in master at the point where the stable10 branch was created.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants