Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up rm bundles #4312

Merged
merged 41 commits into from
Dec 7, 2022
Merged

Speed up rm bundles #4312

merged 41 commits into from
Dec 7, 2022

Conversation

AndrewJGaut
Copy link
Contributor

@AndrewJGaut AndrewJGaut commented Nov 17, 2022

Reasons for making this change

Speed up rm bundles since it's very slow on the Stanford instance (and is slow in general on any instance that has lots of bundles).

Related issues

#4217

@AndrewJGaut AndrewJGaut changed the title added in some timing code for this Speed up rm bundles Nov 17, 2022
@AndrewJGaut
Copy link
Contributor Author

Speedup on dev for deleting bundles:
Original code: ~2.48 seconds
Updated code: ~0.243 seconds

@AndrewJGaut
Copy link
Contributor Author

Note: the reason it was so slow prior is because it would recompute the user disk in entirety each time a bundle was removed (_get_disk_used(user_id) was called, which searches for all bundles owned by the user). To alleviate this, we now subtract the sum of the sizes of the deleted bundles from the user disk used instead, which is much faster.

@AndrewJGaut AndrewJGaut marked this pull request as ready for review November 25, 2022 05:59
@@ -2708,6 +2708,14 @@ def increment_user_time_used(self, user_id, amount):
user_info['time_used'] += amount
self.update_user_info(user_info)

def increment_user_disk_used(self, user_id, amount):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add type hints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

# OK, now let's add our change.
bundle_data_sizes = local.model.get_bundle_metadata(relevant_uuids, 'data_size')
local.model.increment_user_disk_used(
request.user.user_id, (-1) * sum(map(int, bundle_data_sizes.values()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just do -sum...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified this to use something different anyways

local.model.update_user_disk_used(request.user.user_id)
# Just decrement the user disk used by the sum of sizes of bundles deleted
# OK, now let's add our change.
bundle_data_sizes = local.model.get_bundle_metadata(relevant_uuids, 'data_size')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this diff produce the same answer as the above? Just wondering if there are subtle differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did not, it turns out, due to the following edge cases:

  • cl rm -d: Here, the bundle contents are removed but not the bundle metadata. For a given bundle B, we must take care to decrement the user disk quota when cl rm -d B is run, but to not decrement it again when cl rm B is run (since that second cl rm just deletes the metadata and not anything from disk).
  • Files uploaded with symlinks: These shouldn't affect user disk quota regardless, so when we remove them there should be no decrement to the user disk quota.

The new changes fix this.

@@ -1314,7 +1315,12 @@ def delete_bundles(uuids, force, recursive, data_only, dry_run):
local.model.delete_bundles(relevant_uuids)

# Update user statistics
local.model.update_user_disk_used(request.user.user_id)
# Just decrement the user disk used by the sum of sizes of bundles deleted
# OK, now let's add our change.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this comment line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, very nice! Wonder if we could add a test for this.

local.model.update_user_disk_used(request.user.user_id)
# Just decrement the user disk used by the sum of sizes of bundles deleted
# OK, now let's add our change.
bundle_data_sizes = local.model.get_bundle_metadata(relevant_uuids, 'data_size')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also check the case where some bundles that are being deleted don't have data (e.g., a bundle which has had cl rm -d ... done on it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked that case with the new tests.

@@ -310,20 +310,24 @@ def remove(path):

if not FileSystems.exists(path):
FileSystems.delete([path])
return
return True # not sure about this one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return True look good to me

elif os.path.isdir(path):
try:
shutil.rmtree(path)
return True
except shutil.Error:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error, it will have no return value here? Should we use finally here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I added a return False statement at the end of the function to catch any other cases.

@AndrewJGaut
Copy link
Contributor Author

I made some changes to catch edge cases.

Previously, there were no test cases that I saw for verifying that disk quota was incremented and decremented properly upon file upload and deletion. So, I added test cases to check that disk quota is incremented and decremented properly upon:

  • normal file upload and deletion
  • normal file upload and deletion with rm -d before deletion with rm
  • symlinked file upload and deletion

This helps to verify that my rm modification still runs correctly and to verify that our code handles disk quota usage correctly in various cases.

@AndrewJGaut
Copy link
Contributor Author

Now ready for final review @percyliang @wwwjn

Thanks for the reviews!

@AndrewJGaut AndrewJGaut merged commit e6e1e0e into master Dec 7, 2022
@AndrewJGaut AndrewJGaut deleted the fix/4217-speed-up-rm-bundle branch December 7, 2022 20:56
@AndrewJGaut AndrewJGaut mentioned this pull request Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants