Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] removing placement groups sometimes throws a SystemExit error #13487

Closed
2 tasks done
krfricke opened this issue Jan 15, 2021 · 12 comments
Closed
2 tasks done

[core] removing placement groups sometimes throws a SystemExit error #13487

krfricke opened this issue Jan 15, 2021 · 12 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Milestone

Comments

@krfricke
Copy link
Contributor

What is the problem?

Latest master.

When removing placement groups that are used by an actor the actor sometimes fails with a SystemExit error. This occurs after introducing PGs to Ray Tune (#13370).

I'm not sure if this is a bug or a usage error. It only comes up sometimes, not all the time.

Gerenally it would be great to be able to disable SystemExit error messages when removing placement groups. The same might be true for deliberately terminating actors.

Reproduction (REQUIRED)

The repro script is non-deterministic. In the last 10 runs, it failed 5 times (and did not throw an error in the other 5 runs).

The repro script contains a much simplified version of the tune training loop.

import time
from threading import Semaphore, Event, Thread

import ray
from ray.util import placement_group, remove_placement_group

ray.init(num_cpus=2)


@ray.remote
class Actor:
    def __init__(self):
        self.sem = Semaphore()
        self.stop = Event()
        self.thread = None

    def train(self):
        def _train_thread():
            while True:
                self.sem.acquire()
                if self.stop.is_set():
                    print("Stop in 3 seconds")
                    time.sleep(3)
                    return
                print("Train")
        self.thread = Thread(target=_train_thread)
        self.thread.setDaemon(True)
        self.thread.start()

    def cont(self):
        self.sem.release()

    def stop(self):
        self.stop.set()
        self.sem.release()


pg = placement_group([{"CPU": 1}])
actor = Actor.options(placement_group=pg).remote()
ray.get(actor.train.remote())

actor.cont.remote()
actor.cont.remote()

actor.stop.remote()
# actor.__ray_terminate__.remote()
remove_placement_group(pg)

time.sleep(3)
/Users/kai/.pyenv/versions/3.7.7/bin/python /Users/kai/coding/sandbox/tune_pg_raw.py
2021-01-15 11:20:59,127	INFO services.py:1174 -- View the Ray dashboard at http://127.0.0.1:8265
(pid=60623) Train
(pid=60623) Train
(pid=60623) Train
(pid=60623) Stop in 3 seconds
(pid=60623) 2021-01-15 11:21:01,234	ERROR worker.py:390 -- SystemExit was raised from the worker
(pid=60623) Traceback (most recent call last):
(pid=60623)   File "python/ray/_raylet.pyx", line 570, in ray._raylet.task_execution_handler
(pid=60623)   File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
(pid=60623)   File "python/ray/_raylet.pyx", line 509, in ray._raylet.execute_task
(pid=60623)   File "python/ray/_raylet.pyx", line 510, in ray._raylet.execute_task
(pid=60623)   File "python/ray/_raylet.pyx", line 1458, in ray._raylet.CoreWorker.store_task_outputs
(pid=60623)   File "/Users/kai/coding/ray/python/ray/worker.py", line 176, in get_serialization_context
(pid=60623)     def get_serialization_context(self, job_id=None):
(pid=60623)   File "/Users/kai/coding/ray/python/ray/worker.py", line 387, in sigterm_handler
(pid=60623)     sys.exit(1)
(pid=60623) SystemExit: 1
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

cc @rkooo567

@krfricke krfricke added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) core labels Jan 15, 2021
@rkooo567 rkooo567 self-assigned this Jan 15, 2021
@rkooo567 rkooo567 added P2 Important issue, but not time-critical P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) P2 Important issue, but not time-critical labels Jan 16, 2021
@rkooo567
Copy link
Contributor

Set P1 until identifying the root cause.

@ericl ericl added this to the Core Bugs milestone Jan 20, 2021
@ericl ericl removed the core label Jan 20, 2021
@rkooo567
Copy link
Contributor

cc @oliverhu Do you have some time to take a look at this?

@oliverhu
Copy link
Member

a bit hectic recently, is this more important than the "possible unhandled error" one?

@rkooo567
Copy link
Contributor

I don't think so (probably similar). If you are busy, it is totally fine! I can take a look at it later.

@oliverhu
Copy link
Member

❤️ thanks!

@clay4megtr
Copy link
Contributor

hi, @rkooo567 , is this in process? I fount this issue still exist, Can I take over this issue?

@clay4megtr clay4megtr self-assigned this Feb 2, 2021
@rkooo567 rkooo567 removed their assignment Feb 2, 2021
@oliverhu
Copy link
Member

oliverhu commented Feb 4, 2021

actually isn't this issue the same as the problem we tried to solve in #13140 @rkooo567 ?..

@rkooo567
Copy link
Contributor

rkooo567 commented Feb 7, 2021

Hmm, possibly? I am not 100% sure if it is the same issue.

@oliverhu
Copy link
Member

oliverhu commented Feb 8, 2021

I think similar, currently PG removal -> SystemExit error, and we want a new type of error to make the message more explicit to the customers. 🤔 @clay4444 you got any update on this ?

@ericl ericl added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Feb 22, 2021
@rkooo567
Copy link
Contributor

This is actually an expected behavior at least right now. We basically kill the actor that are associated with the placement group if the pg is removed.

@krfricke what's the expected behavior you'd like to see here? Would you 1. keep the actor alive or 2. raise a different error like PlacementGroupRemovedError?

@rkooo567
Copy link
Contributor

rkooo567 commented Nov 2, 2021

Duplicate #10232

@rkooo567 rkooo567 closed this as completed Nov 2, 2021
@bhavik66
Copy link

bhavik66 commented Jan 7, 2022

You can try by deleting the folder

tmp/ray/<your session>/logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

6 participants