Calling ray.init() with too much object store memory causes object store to crash on Linux. #3670

robertnishihara · 2019-01-01T00:05:28Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu
Ray installed from (source or binary): source
Ray version: 0.6.1
Python version: 3.6.3

Starting the object store with 50GB on an m5.4xlarge instance

In [1]: import ray
ray
In [2]: ray.init(object_store_memory_mb=50*1000)
WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Process STDOUT and STDERR is being redirected to /tmp/ray/session_2018-12-31_23-58-25_28453/logs.
Waiting for redis server at 127.0.0.1:14322 to respond...
Waiting for redis server at 127.0.0.1:24609 to respond...
WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 33008.238592MB available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
Starting the Plasma object store with 50.0GB memory using /tmp.
E1231 23:58:25.641736 28607 io.cc:167] Connection to IPC socket failed for pathname /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store, retrying 50 more times

doesn't immediately fail because we think that the machine has 66GB memory

In [3]: ray.utils.get_system_memory_bytes() // 10**9
Out[3]: 66

However, the plasma store fails to start because the way we check memory in Arrow appears to think we only have 42GB. Note that I'm passing in -d /tmp.

~$ ray/build/external/arrow-install/bin/plasma_store_server -s /tmp/store -m 50000000000 -d /tmp
I0101 00:02:24.788489 28722 store.cc:994] Allowing the Plasma store to use up to 50GB of memory.
I0101 00:02:24.788723 28722 store.cc:1024] Starting object store with directory /tmp and huge page support disabled
F0101 00:02:24.788743 28722 store.cc:1039] System memory request exceeds memory available in /tmp. The request is for 50000000000 bytes, and the amount available is 42490683392 bytes. You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
*** Check failure stack trace: ***
    @           0x44212c  google::LogMessage::Fail()
    @           0x442070  google::LogMessage::SendToLog()
    @           0x4419b2  google::LogMessage::Flush()
    @           0x4417ad  google::LogMessage::~LogMessage()
    @           0x43e5e0  arrow::util::ArrowLog::~ArrowLog()
    @           0x415b04  main
    @     0x7f5039fa4830  __libc_start_main
    @           0x415f09  _start
    @              (nil)  (unknown)
Aborted (core dumped)

The relevant code for checking memory in the plasma store is https://github.com/apache/arrow/blob/71ccba9b217a7af922d8a69be21ed4db205af741/cpp/src/plasma/store.cc#L1028-L1037. The issue may be that we're checking shared memory size instead of regular memory.

Note that the actual failure raised by ray.init is

E1231 23:58:30.547828 28607 io.cc:167] Connection to IPC socket failed for pathname /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store, retrying 1 more times
E1231 23:58:30.549942 28453 io.cc:167] Connection to IPC socket failed for pathname /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store, retrying 1 more times
F1231 23:58:30.647966 28607 object_store_notification_manager.cc:22]  Check failed: _s.ok() Bad status: IOError: Could not connect to socket /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store
*** Check failure stack trace: ***
    @           0x5d7cd0  google::LogMessage::Fail()
    @           0x5d7c14  google::LogMessage::SendToLog()
    @           0x5d7556  google::LogMessage::Flush()
    @           0x5d7351  google::LogMessage::~LogMessage()
    @           0x5c5c50  arrow::util::ArrowLog::~ArrowLog()
    @           0x5770e7  ray::ObjectStoreNotificationManager::ObjectStoreNotificationManager()
    @           0x526cfa  ray::ObjectManager::ObjectManager()
    @           0x4c0e67  ray::raylet::Raylet::Raylet()
    @           0x4ae3c9  main
    @     0x7fc2edc6d830  __libc_start_main
    @           0x4b3d19  _start
    @              (nil)  (unknown)
---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-2-f09714301df8> in <module>()
----> 1 ray.init(object_store_memory_mb=50*1000)

~/ray/python/ray/worker.py in init(redis_address, num_cpus, num_gpus, resources, object_store_memory, object_store_memory_mb, redis_max_memory, redis_max_memory_mb, collect_profiling_data, node_ip_address, object_id_seed, num_workers, local_mode, driver_mode, redirect_worker_output, redirect_output, ignore_reinit_error, num_redis_shards, redis_max_clients, redis_password, plasma_directory, huge_pages, include_webui, driver_id, configure_logging, logging_level, logging_format, plasma_store_socket_name, raylet_socket_name, temp_dir, _internal_config, use_raylet)
   1619         _internal_config=_internal_config,
   1620     )
-> 1621     ret = _init(ray_params, driver_id=driver_id)
   1622     for hook in _post_init_hooks:
   1623         hook()

~/ray/python/ray/worker.py in _init(ray_params, driver_id)
   1428         mode=ray_params.driver_mode,
   1429         worker=global_worker,
-> 1430         driver_id=driver_id)
   1431     return ray_params.address_info
   1432 

~/ray/python/ray/worker.py in connect(ray_params, info, mode, worker, driver_id)
   1975     # Create an object store client.
   1976     worker.plasma_client = thread_safe_client(
-> 1977         plasma.connect(info["store_socket_name"]))
   1978 
   1979     raylet_socket = info["raylet_socket_name"]

~/ray/python/ray/pyarrow_files/pyarrow/_plasma.pyx in pyarrow._plasma.connect()

~/ray/python/ray/pyarrow_files/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: Could not connect to socket /tmp/ray/session_2018-12-31_23-58-25_28453/sockets/plasma_store

The text was updated successfully, but these errors were encountered:

npyoung · 2019-01-11T20:55:44Z

I'm getting this as well. Ray starts and mentions it's using /tmp. Driver then cannot find the ray process when calling ray.init().

robertnishihara · 2019-03-07T20:49:09Z

See modin-project/modin#468 for a concrete instantiation of this problem.

pcmoritz · 2019-03-07T21:16:19Z

I'm contemplating to remove this check from plasma altogether. This could bring back the dreaded "BUS ERROR" problem, but if that's the case we can do tighter checking on the python side before the plasma store is started and give a better error message. It's certainly bad to have two different and inconsistent checks here.

pcmoritz · 2019-03-07T21:23:17Z

A safer method is to keep the check in plasma and start the store with as much memory as is available, and then get that amount through the python client and issue a warning/error if it is not enough.

edoakes · 2020-03-05T23:21:30Z

Stale - please open new issue if still relevant

crypdick · 2020-09-09T01:03:53Z

@edoakes just experienced this today. Was getting errors with placing objects into shared object store, which recommended to set ray.init(object_store_memory=<bytes>). I set 500GB/768 GB RAM for object storage on a r5.metal and was scratching my head why I was getting these errors.

robertnishihara added the bug Something that is supposed to be working; but isn't label Jan 1, 2019

robertnishihara mentioned this issue Jan 28, 2019

ARROW-4296: [Plasma] Use one mmap file by default, prevent crash with -f apache/arrow#3490

Closed

robertnishihara mentioned this issue Mar 7, 2019

Error importing modin on Linux: Connection to IPC socket failed for pathname /tmp/ray/session... modin-project/modin#468

Closed

derasaria mentioned this issue Jun 5, 2019

Unable to get ray.init() to work on a private cluster #4936

Closed

edoakes closed this as completed Mar 5, 2020

rkooo567 mentioned this issue Apr 8, 2020

ray.init() Memory Error - Could not connect to socket #7932

Closed

2 tasks

asfimport mentioned this issue Mar 7, 2019

[Plasma] Avoid store crash if not enough memory is available apache/arrow#21316

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calling ray.init() with too much object store memory causes object store to crash on Linux. #3670

Calling ray.init() with too much object store memory causes object store to crash on Linux. #3670

robertnishihara commented Jan 1, 2019

npyoung commented Jan 11, 2019

robertnishihara commented Mar 7, 2019

pcmoritz commented Mar 7, 2019 •

edited

Loading

pcmoritz commented Mar 7, 2019

edoakes commented Mar 5, 2020

crypdick commented Sep 9, 2020

Calling ray.init() with too much object store memory causes object store to crash on Linux. #3670

Calling ray.init() with too much object store memory causes object store to crash on Linux. #3670

Comments

robertnishihara commented Jan 1, 2019

System information

npyoung commented Jan 11, 2019

robertnishihara commented Mar 7, 2019

pcmoritz commented Mar 7, 2019 • edited Loading

pcmoritz commented Mar 7, 2019

edoakes commented Mar 5, 2020

crypdick commented Sep 9, 2020

pcmoritz commented Mar 7, 2019 •

edited

Loading