-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory-aware task scheduling to avoid OOMs under memory pressure #20495
Comments
This would be a killer feature, especially for dask-on-ray 👍 |
Is this only for normal tasks/actor creation task? What happen for actor tasks? |
There should be:
Side question: |
Some thoughts: Preemption Priority LevelOne might also want to have a pre-emption priority level. Maybe could have 5 (or 3, or 4) levels?
We can also denote the levels by labels: Also, I prefer this API - very intuitively expresses notion of "preemptibility": @ray.remote(preemptible=0) # equivalent to `preemptible = "false"/"never"`, this is the default level
def my_task:
@ray.remote(preemptible=4) # equivalent to `preemptible = "always"`
def my_other_task: Alternatives considered (not recommended): @ray.remote(preemptible={ "level": 4 }) # equivalent to `preemptible = true`
def my_task: Alternatively, can reverse the levels, and call it "priority" which stands for "scheduling priority": @ray.remote(preemptible={ "priority": 0 }) # equivalent to `preemptible = true`
def my_task:
@ray.remote(preemptible={ "priority": "MAX" }) # equivalent to `preemptible = false`
def my_other_task: Additional Considerations For Short-Lived Bursty Tasks, Polling Intervals, and Memory BandwidthI think we need a combination of preemption (long-running tasks/actors) and scheduling back-pressure (short-lived tasks which can burst memory CPU usage etc). Profiling short-lived tasks, as described below, could be difficult, though, especially very fine-grained tasks. Based on the polling interval granularity, the thing we would want to avoid is ping-ponging of resource usage - a raylet schedules too many tasks, then cuts back scheduling new tasks due to preemption/backpressure, which results in low resource usage in next interval, which results in ping pong-ing back to too much. Since Ray's task granularity is sub ms, 100ms while reasonable-sounding might not work for certain workloads. How realistic these scenarios are should be investigated. So the parameters to consider are how much to preempt, and also using profiling history to gain an "average view" of node resource usage, which could deal with short-lived, bursty tasks, or apply backpressure based on time-windowed peak/p95 usage, instead of on point-in-time resource usage. Making this Consideration more ConcreteFor some context, consider the fact that a Threadripper 3990x has a ~100GiB memory bandwidth, which can grow/shrink memory usage by 15GB/100ms. Datacenter chipsets may have even higher memory bandwidth. This does suggest that something on the order of 100ms does seem like a reasonable interval to poll resource usage at. To summarize, the relevant metric to be aware of is Note that asking the OS to allocate new memory is typically slower than asking the CPU to touch the same memory. (For reference: https://lemire.me/blog/2020/01/17/allocating-large-blocks-of-memory-bare-metal-c-speeds/). However, I don’t know if there are pathways which are more rapid, for instance, directly MMAP-ing a large number of pages into the process memory... (although from my vague understanding it is actually a costly process… and in addition the kernel touches all of the memory anyway by 0-memsetting.) Session-Based Profiling Instead of Preempting
As an extension, I think one could also have a profiling-based memory-aware scheduler, by grouping by the memory usage by the task's/actor's unique identifiers - function descriptor + language type, and store it in the GCS, so that Raylets have some idea of the peak/p95/avg memory usage for that task/actor instance for the given session. For even better scheduling, consider storing binned histograms rather than summary statistics like peak/p95/avg usage, or even binned time series histogram (2 dim of bins for each resource - and Ray also takes into account task/actor profile over the course of its lifetime).
Raylets can accumulate/aggregate locally-collected statistics over intervals, and periodically sync with GCS to prevent inundating the GCS with network load. Then the scheduling can be chosen to either be conservative (pack by peak usage) or aggressive (pack by avg usage), and use some statistics to figure out reasonable thresholds for the combined memory usage for task type (or "task_group" - a label for specific - or groups of - invocations of that task), thus not having to rely on preempting except in statistical outlier scenarios, or for coarse-grained tasks which do not obey law of large numbers. Conclusion: profiling-based placement is especially useful for actors and long-running tasks with highly-variable memory usage.Rationale: short-lived tasks are better dealt with memory backpressure. The same could be true of tasks & actors with stable memory usage, but perhaps to account for startup time, knowing the peak memory usage also helps with scheduling and not accidentally resulting in OOM/preemption. APIs for Specifying Profile-Guided Behaviourray.init(
config: { ...,
profile_guided_scheduling: {
memory: "conservative=p95"/"aggressive"/"histogram",
cpu: "conservative=peak"/"aggressive"/"histogram",
"custom_resource": ...,
},
}
) Configuration on the task/actor level: '''
ray will make a best effort to place this actor/task based on the
actor/task's resource usage profile and its knowledge of the resource usage on
each node
'''
@ray.remote(profile_guided="memory,cpu")
class MyActor:
'''
ray will not collect statistics for resource usage for this task. It will not consider
task-specific profiling data when deciding how to schedule this task
It can still rely on preemption and memory back-pressure to choose
how to schedule this task.
'''
@ray.remote(profile_guided="none") # this is the default value
def my_task: I think this is very similar to cardinality estimation in databases, you should tune your plans to the current workload. We could persist Profiling as an alternative to Placement GroupsI think this would be extremely useful for profiling-based placement of actor creation tasks (unless actors spin up processes with memory that live outside of the worker's heap - but one could do some magic with profiling all of a worker's child processes). Relative Importance of Type of ProfilingIn my mind, memory profiling-based scheduling is more useful than cpu profiling, since scheduling for the latter poorly merely results in resource contention on a stateless resource (CPU), and at most results in some small increase in task latency, whereas memory contention can result in OOM or spilling to swap, both of which one would want to avoid at all costs. Likewise, fractional GPU-usage scheduling/placement and profiling is more dependent on GPU-memory consumption that CU utilization. Related: Preemptible Actors, Serializable Actors and Actor Checkpoint RecoveryRelatedly, consider my idea on One can also rely on the notion of a preemptible Actor, if there is a safe way to "interrupt" the actor's ability to process new tasks (the actor has status "suspended"), and tasks scheduled for that actor will not be scheduled, until it once again has status "active". Here is the flow of events for preempting an actor:
An objective in pursuing preemptible actors is to make sure that this sequence of events can happen very rapidly. In staleness-tolerant use-cases, this can be aided by leveraging stale actor checkpoints (described more below). Serializable actors and fault-toleranceSerializable actors could also fit into the fault-tolerance picture. Instead of object lineage, Ray can have a native notion of checkpointed serializable actors. The actor ocassionally broadcasts its checkpoint to the object store on nodes with available memory resources. When a node with an actor crashes, Ray can recover the actor by choosing one node to recover the actor from the checkpoint. This uses Ray's inherently distributed object store as an alternative to persistent storage for checkpointing - and could result in much faster actor recovery as one does not need to read from disk. This might also kill two birds with one stone - if we are ok on relying on slightly stale data, a preempted Actor might not have to directly transport its checkpoint at time of preemption, relying instead on a stale checkpoint on a remote node. Additional Ideas: Ownership for Actors (incomplete thoughts)Just like we have ownership for objects, one can perhaps reduce the GCS burden for spinning up new actors by letting other actors/tasks own a child actor. Putting It All TogetherHere is the API for specifying an actor's:
Extended Ray Actor API # class-level preemptibility,
# and orthogonally,
# profile guided placement for this actor;
#
# by definition an actor remote method must be scheduled
# wherever the actor is and has no notion of independent
# profile-guided scheduling
#
# An actor's resource-usage profile inherits from its task resource usage
# profiles, the GCS could either group by profiling data by the task across
# the entire actor class, or the task specific to an actor instance.
@ray.remote(
preemptible="1", # try to preempt other tasks/actors first
# Profile guided placement:
# try to place actor based on session profile data;
# can talk to autoscaler to determine if should spin up a new node
# for this actor to be schedule on
# (or rescheduled on, e.g. after preemption/recovery)
#
# if a placement group is specified, using profile_guided placement
# will override the placement strategy for that resource
# (think more about this..., esp. for custom resources...)
#
# if no `ray.session_profile` is provided, it will start with
# "peak" scheduling strategy being that actor's placement group
# and then start rescheduling based on profiling
# after a warm-up period/when a raylet faces resource pressure,
# or conversely, resource under-utilization
profile_guided="memory,cpu",
# whether or not to collect task profiling data,
# to be added to this class's/class instance's total resource usage
task_profiling_group_by="class,instance",
# alternately, use decorator as below.
# A preemptible actor class must implement ray.(de)serialize_actor
#
# Or the actor must be serializable by default serializer (e.g. pickle5)
# If not, either it will not be preemptible, or Ray with throw a runtime panic.
# (compile-time error for compiled languages, if possible)
serialize_actor="serialize",
deserialize_actor="deserialize",
# likewise, checkpointing requires `(de)serialize_actor` to be defined
checkpoint_recovery={
# Number of nodes to broadcast to
num_nodes: {
# fraction of total number of peer nodes, with optional minimum
# behaviour is either to panic if number of nodes in Ray cluster is < min,
# or to default to max in that scenario.
fractional: "max/0.5,min=3"
explicit: 3 # explicit number,
},
# strategy for choosing which nodes to broadcast to
# default is resource-aware
node_selection="fixed,resource-aware,random",
# whether to broadcast data to all nodes at once,
# or in a round-robin (one node at a time) at every
# `frequency` time-interval
broadcast_strategy: "round_robin/all",
frequency: "5s",
retry: 3,
panic_on_missing_checkpoint: false,
preempt_from_stale_checkpoint: true/"timeout=10s", # default is false
},
)
def Actor:
@ray.serialize_actor
def serialize(self):
return Serializer.serialize(self)
@ray.deserialize_actor
def deserialize(obj: RayObject) -> Self:
return Serializer.deserialize(obj)
# instance-level override for preemptibility of actor instance
actor = Actor.remote(preemptible="0")
actor = ray.remote(Actor, preemptible="0")
No but I think it is under discussion: #17596 |
@ericl I can split up the above ideas into RFCs if it is of interest. I guess I have a lot of thoughts here. I also don't know how the ideas contained within connect with existing intiatives/RFCs, or existing thinking on how to split responsibility for functionality between ray core (language/application-independent) and its dependent libraries (language/application-specific - violation of DRY, or necessary specialization?). Also happy to take the discussion offline as I understand its very dense. |
To summarize, as a first step to memory-aware scheduling, I am for preemption as a task-level opt-in behaviour, and possibly memory back-pressure as a default behaviour, with configurable threshold/polling interval/window and intelligent defaults, e.g. # default values
ray.init(config = { .., scheduler_memory_backpressure { threshold: 0.8, interval: "100ms", }, .. } ) We can provide a guide on user to calculate a threshold that is suitable to their nodes' The calculation might change we also consider GPU bandwidth (for pinned memory, i.e. DMA) or RDMA network ingress, which might also dump a lot of data into the node's memory independently of the node's CPU bandwidth. However, these (esp. the latter) might be edge cases. Some additional consideration also needs to be given for time taken to clean up preempted tasks’ memory usage. |
Old (largely wrong) thoughts on the difficulty of implementing preemption. The notion of preemption itself is quite contentious even for tasks. For instance, in the C++ API (and as was originally planned for the Rust API), tasks run in the same process/thread as the worker (EDIT: this is false, see above). I don’t know what available language features exist for Python and Java to handle preemption gracefully, but I suspect the scenario might be similar. Terminating the worker process itself is a possibility, and existing mechanisms for recovering from worker failures can kick in… since preemption, to my mind, ought to be used as a last resort and can be seen as an emergency manoeuvre to prevent OOMing the node, this does seem like a reasonable option, though not without its overhead. An So does preempting an actor. Firstly, if we are forcefully terminating the worker process, a task that is running midway through for an actor might not leave the actor in a consistent state to be persisted or serialised. Forcefully killing a thread is an even worse idea as the OS might not clean up dangling file descriptors, memory allocations etc. If we allow tasks to run to completion and stop new tasks from being scheduled, then serialization and all that good stuff can occur, but we don’t know if tasks will complete in some reasonable time or if at all (e.g. for loops?). The most ideal scenario for preemption that does not involve terminating the worker process itself is when one has an async executor/event loop running in the task or actor process context. It may be possible to implement an So to conclude, more thought should be given to how preemption is handled. |
I think for preemption that can trigger on_preemption handlers, and also which do not require you to kill your entire worker process, you need to handle breakpoints in your application code loop, say, based on a channel. So we'd need an additional API for exposing this channel, I believe. The I'm looking into async event loops to understand how they do similar things eventually, like forcing a yield. And also if the current DirectTaskReceiver has a way of at least terminating and cleaning up the user function. Presumably, this would be easier for single java processes, since they live in the JVM, but how about the scenario when there are many workers threads in the same process and which reference the same heap? All of this seems like pretty high programming overhead. @Hoeze do you have any thoughts on needing to insert breakpoints in your code, for tasks you know might be problematic memory-wise? For instance ray.yield()
def handle_yield(data):
with ray.yield():
save(data)
@ray.remote
def my_task:
do_something()
ray.yield()
data = process_data()
handle_yield(data)
data2 = process_data_more(data) The above idea for having preemption priorities does not change in light of this, as which task is preempted is not specificied within the task definition, only the possibility for yielding. This method creates a response time issue, however, which is that cooperative scheduling as above may not yield for arbitrary amounts of time. Also, what is the point of handling yield? Can a worker then restart the task from the checkpointed data? How should one express this? Maybe: def my_task:
data = process_data()
data = ray.yield_recoverable(data) # puts to object store if yielding, which becomes additional input to "interrupted_task"... Task entrypoint also has a go-to built in to jump to the right breakpoint based on the interrupted_task metadata
data2, data3 = process_data_more(data)
data2, data3 = ray.yield_recoverable(data2, data) But maybe this doesn't make sense as you may be better off with tasks of finer granularity. At the very least, a preemption handler could handle cleanup for certain things like killing child processes and closing file handlers... Personally I think yield with cleanup is a good idea, but yield with checkpoint is not. |
This suggest one should have the following config for preemption: # cooperative means you rely on application to yield
# forced means you kill the worker process
@ray.remote(preemption = { level: "always", type: "cooperative/forced" })
def my_func: |
@Hoeze @jon-chuang good thoughts. Convert to a google doc? Indeed it's difficult to have a complex discussion in GitHub format. @stephanie-wang also has a initial design doc here: https://docs.google.com/document/d/1AG1Nx2znKZLsld92V1hvGIcsY-iSIzeuzhbs74w-70g/edit#heading=h.17dss3b9evbj
This is very similar to what @mwtian has been exploring as "distributed co-routines". It turns out https://github.com/llllllllll/cloudpickle-generators can pickle async functions in the middle of execution since those can be converted to generators. Hence, we can achieve something similar. That said, I'm not sure that level of preemption support is needed. For most workloads, just a |
@jon-chuang Is it possible that the task can receive SIGTERM?
It's very easy to capture the SIGTERM exception in python for doing pre-kill tasks. One could think over giving the client the possibility to disable the grace-period (time before the task gets killed) by sending some "SIGTERM received; I'm preempting now..." message to the cluster. |
Overview
Currently, the Ray scheduler only schedules based on CPUs by default for tasks (e.g., num_cpus=1). The user can also request memory (e.g., memory=1e9), however in most applications it is quite difficult to predict the heap memory usage of a task. In practice, this means that Ray users often see OOMs due to memory over-subscription, and resort to hacks like increasing the number of CPUs allocated to tasks.
Ideally, Ray would manage this automatically: when tasks consume too much heap memory, the scheduler should pushback on the scheduling of new tasks and preempt eligible tasks to reduce memory pressure.
Proposed design
Allow Ray to preempt and kill tasks that are using too much heap memory. We can do this by scanning the memory usage of tasks e.g., every 100ms, and preempting tasks if we are nearing a memory limit threshold (e.g., 80%).
Furthermore, the scheduler can stop scheduling new tasks should we near the threshold.
Compatibility: Preempting certain kinds of tasks can be unexpected, and breaks backwards compatibility. This can be an "opt-in" feature initially for tasks. E.g., "@ray.remote(memory="auto")` in order to preserve backwards compatibility. Libraries like multiprocessing and Datasets can enable this by default for their map tasks. In the future, we can try to enable it by default for tasks that are safe to preempt (e.g., those that are not launching child tasks, and have retries enabled).
Acceptance criteria: As a user, I can run Ray tasks that use large amounts of memory without needing to tune/tweak Ray resource settings to avoid OOM crashes.
The text was updated successfully, but these errors were encountered: