Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] add tmpfs mounting support in rayci #42247

Merged
merged 1 commit into from
Jan 9, 2024
Merged

Conversation

aslonnie
Copy link
Collaborator

@aslonnie aslonnie commented Jan 9, 2024

and run out of disk test with it.

on newer version of docker uses newer version of overlayfs, which calculates free disk space differently and will fail the test. using tmpfs will make the test pass again.

@aslonnie aslonnie marked this pull request as draft January 9, 2024 00:47
@aslonnie aslonnie force-pushed the lonnie-0108-outofspace branch 2 times, most recently from 5b21cca to dd8698e Compare January 9, 2024 04:06
@aslonnie aslonnie changed the title out of disk space test debug on new stack [core] add tmpfs mounting support in rayci Jan 9, 2024
@aslonnie aslonnie marked this pull request as ready for review January 9, 2024 04:11
@aslonnie
Copy link
Collaborator Author

aslonnie commented Jan 9, 2024

interestingly, it seems that it will make test_ood_events flaky somehow..

@aslonnie
Copy link
Collaborator Author

aslonnie commented Jan 9, 2024

interestingly, it seems that it will make test_ood_events flaky somehow..

my guess is that there is an event queue or something, and list_cluster_events is async, and can need some time to wait to receive all the events to appear. I am adding a 5 second wait as an attempt to deflake. it is not the right fix though.

@can-anyscale
Copy link
Collaborator

Are there any downside for using tempfs universally for all linux tests instead of having that as an option?

@@ -75,6 +75,16 @@ steps:
--test_tag_filters=mem_pressure -- //python/ray/tests/...
job_env: corebuild

- label: ":ray: core: out of disk tests"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing the redis version of the test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


via GIPHY

@@ -247,6 +247,10 @@ def foo():
except ray.exceptions.RayTaskError as e:
assert isinstance(e.cause, ray.exceptions.OutOfDiskError)

# Give it some time for events to appear.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can wrap the validation logic in a wait_for_condition like this:

def verify():
events = list_cluster_events()
print(events)
assert len(events) == 1
assert (
"Error: No available node types can fulfill " "resource request"
) in events[0]["message"]
return True
wait_for_condition(verify)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that feels worse. it is mixing result checking and waiting for result coming back, and treating wrong results as acceptable.

waiting for 5 secs is not too bad here for now. will leave to the core team to refactor down the road.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is changed to wait for 2 seconds now fwiw.

@aslonnie
Copy link
Collaborator Author

aslonnie commented Jan 9, 2024

Are there any downside for using tempfs universally for all linux tests instead of having that as an option?

in theory, no, but it does not work.. many more tests will fail, and I am not really interested on investigating and fixing all of those.

Copy link
Collaborator

@can-anyscale can-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aslonnie
Copy link
Collaborator Author

aslonnie commented Jan 9, 2024

the redis test fails somehow :(

@aslonnie
Copy link
Collaborator Author

aslonnie commented Jan 9, 2024

changed test size, and decreased the wait. seems that redis tests run slower.

@aslonnie aslonnie merged commit 351b401 into master Jan 9, 2024
9 checks passed
@aslonnie aslonnie deleted the lonnie-0108-outofspace branch January 9, 2024 22:00
vickytsang pushed a commit to ROCm/ray that referenced this pull request Jan 12, 2024
aslonnie added a commit that referenced this pull request Jan 26, 2024
cherrypick #42247

pure test code change. no code change.

Signed-off-by: Lonnie Liu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants