Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Stress Tests #4406

Merged
merged 19 commits into from
Mar 2, 2023
Merged

Fix Stress Tests #4406

merged 19 commits into from
Mar 2, 2023

Conversation

AndrewJGaut
Copy link
Contributor

@AndrewJGaut AndrewJGaut commented Feb 27, 2023

This addresses: #4404

This PR does several things to alleviate issues brought to light by stress tests:

  • Reset ITERATIONS_PER_DISK_CHECK to be 2000 rather than 1. When it was 1, upload speed was significantly slower since the client was sending a request to server to update user_disk_used for every 16KB of data uploaded. With it now set to 2000, the user_disk_used will be updated every 32MB (and will be updated with the initial 16KB on the first iteration of the upload). Upload speeds are now approximately in line with previous speeds according to benchmarking tests.
  • Fix an SQL deadlock issue. When user_disk_used was incremented, we were selecting the user disk used in one statement, incrementing it, and then updating it in another statement. This was not atomic and was leading to an SQL deadlock and possible race conditions. To alleviate this, we now use select_for_update rather than just select when getting user_disk_used.
  • Fix bundle finalization issue. Stress tests were also taking significantly longer because bundles were staying in the "finalized" state for a long time. It turns out that the socket timeout set in the bundle_manager for the finalization message sent to workers was too low, and so we were frequently getting a socket timeout error for that message and the bundle_manager was taking a long time to progress to actual finalization of the bundle.
  • Update stress tests: Stress tests now include timing for each individual stress test as well as overall timing. This gives us a better handle for which tests are taking up the majority of the time.

@AndrewJGaut AndrewJGaut changed the title test draft Fix Stress Tests Feb 28, 2023
@AndrewJGaut AndrewJGaut marked this pull request as ready for review February 28, 2023 19:16
@AndrewJGaut
Copy link
Contributor Author

I should note that stress tests are now finishing in around 4 hours as they were previously and there were no SQL deadlock errors that I saw. Once this PR is in, therefore, we should finally be able to deploy again!

@codalab codalab deleted a comment from mergify bot Feb 28, 2023
@@ -38,34 +46,38 @@ class TestFile:

def __init__(self, file_name, size_mb=1, content=None):
self._file_name = file_name
self._file_path = temp_path(file_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should note this oddity: suddenly, when running stress tests, I started getting an Errno 13 Permission Denied error when the stress tests would try to create a file. To get around this, instead of creating the file in the working directory, I create it /tmp/ and then update the rest of the code accordingly.

Not sure why this issue just started popping up, though.

Copy link
Member

@epicfaace epicfaace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

"""
User used some time.
Increment disk_used for user by amount.
When incrementing values, we have to use a special query to ensure that we avoid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we shorten this docstring? probably a bit too much detail for posterity

time.sleep(
0.3
0.1
) # changed from 0.003 to keep from rate-limiting due to dead workers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) # changed from 0.003 to keep from rate-limiting due to dead workers
)

@@ -179,3 +181,48 @@ def cleanup(cl, tag, should_wait=True):
run_command([cl, 'wrm', uuid, '--force'])
worksheets_removed += 1
print('Removed {} bundles and {} worksheets.'.format(bundles_removed, worksheets_removed))


class timer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uppercase?

Copy link
Member

@epicfaace epicfaace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@@ -393,6 +391,7 @@ def main():
runner.run()
duration_seconds = time.time() - start_time
print("--- Completion Time: {} minutes---".format(duration_seconds / 60))
print(json.dumps(runner._runs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's print this before we print completion time.

@AndrewJGaut AndrewJGaut merged commit 7833e8e into master Mar 2, 2023
@AndrewJGaut AndrewJGaut deleted the fix-deadlock2 branch March 2, 2023 04:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants