-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calling ray.init() with too much object store memory causes object store to crash on Linux. #3670
Comments
I'm getting this as well. Ray starts and mentions it's using |
See modin-project/modin#468 for a concrete instantiation of this problem. |
I'm contemplating to remove this check from plasma altogether. This could bring back the dreaded "BUS ERROR" problem, but if that's the case we can do tighter checking on the python side before the plasma store is started and give a better error message. It's certainly bad to have two different and inconsistent checks here. |
A safer method is to keep the check in plasma and start the store with as much memory as is available, and then get that amount through the python client and issue a warning/error if it is not enough. |
Stale - please open new issue if still relevant |
@edoakes just experienced this today. Was getting errors with placing objects into shared object store, which recommended to set |
System information
Starting the object store with 50GB on an m5.4xlarge instance
doesn't immediately fail because we think that the machine has 66GB memory
However, the plasma store fails to start because the way we check memory in Arrow appears to think we only have 42GB. Note that I'm passing in
-d /tmp
.The relevant code for checking memory in the plasma store is https://github.com/apache/arrow/blob/71ccba9b217a7af922d8a69be21ed4db205af741/cpp/src/plasma/store.cc#L1028-L1037. The issue may be that we're checking shared memory size instead of regular memory.
Note that the actual failure raised by
ray.init
isThe text was updated successfully, but these errors were encountered: