-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add job table to state API #5076
Conversation
Test FAILed. |
Test PASSed. |
Test FAILed. |
Args: | ||
job_id: A job ID to get information about. | ||
Returns: | ||
A dictionary with information about the job ID in question. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to define a JobInfo
class instead of using a dictionary for this. It's only a few fields and no logic right now, but this will likely grow in scope over time. Would also improve documentation over just having the fields in a header comment for another function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree! Let's do this change in a followup PR, since all the other tables (tasks, objects, clients) are dictionaries in the same format at the moment and we should be consistent.
Other possible choices we could consider are something json compatible with json-schema (would make it easy to put this API behind a REST endpoint in the future), protobuf, namedtuple (meh), dataclass (meh bc might require new python), any thoughts? cc @simon-mo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few considerations:
- there's a backport for dataclass
pip install dataclasses
so that won't be an issue. - in the api schema world, openapi schema is fairly big as well. (Kubernetes use it)
- protobuf in python is hard to work with. Non-intuitive API, slow serialization
- dataclass > namedtuple
python/ray/state.py
Outdated
Returns: | ||
A dictionary with information about the job ID in question. | ||
""" | ||
# Allow the argument to be either a DriverID or a hex string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean JobID
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually fixed in the comment above, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the fix got rolled back when I force pushed
python/ray/state.py
Outdated
"""Fetch and parse the job table information for a single job ID. | ||
|
||
Args: | ||
job_id: A job ID to get information about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should note that it can be a JobID
or hex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not actually fixed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it got rolled back
python/ray/state.py
Outdated
def job_table(self): | ||
"""Fetch and parse the Redis job table. | ||
Returns: | ||
Information about the Ray jobs in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Information in what format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Test FAILed. |
""" | ||
# Allow the argument to be either a DriverID or a hex string. | ||
if not isinstance(job_id, ray.JobID): | ||
job_id = ray.JobID(hex_to_binary(job_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also assert the type of job_id
is str?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -227,6 +227,12 @@ message JobTableData { | |||
bytes job_id = 1; | |||
// Whether it's dead. | |||
bool is_dead = 2; | |||
// The UNIX timestamp corresponding to this event (job added or removed). | |||
int64 timestamp = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to declare 2 fields start_timestamp
and stop_timestamp
explicitly instead of 1 timestamp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each entry in the JobTable is a log entry (either job addition, or job removal, if is_dead = true), so a single timestamp is indeed the more natural representation here. They get aggregated into start time and stop time in the client API.
// The UNIX timestamp corresponding to this event (job added or removed). | ||
int64 timestamp = 3; | ||
// IP of the node this job was started on. | ||
string node_manager_address = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just node address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to be consistent with the ClientTable
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
src/ray/raylet/node_manager.cc
Outdated
worker_pool_.RegisterDriver(std::move(worker)); | ||
local_queues_.AddDriverTaskId(driver_task_id); | ||
RAY_CHECK_OK( | ||
gcs_client_->job_table().AppendJobData(JobID(driver_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment, these definitions are pretty inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will get nicer once #5110 is merged.
Test FAILed. |
python/ray/state.py
Outdated
Returns: | ||
A dictionary with information about the job ID in question. | ||
""" | ||
# Allow the argument to be either a DriverID or a hex string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't actually fixed in the comment above, right?
python/ray/state.py
Outdated
"""Fetch and parse the job table information for a single job ID. | ||
|
||
Args: | ||
job_id: A job ID to get information about. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not actually fixed, right?
python/ray/state.py
Outdated
|
||
def job_table(self): | ||
"""Fetch and parse the Redis job table. | ||
Returns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indentation is wrong and missing newline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
python/ray/state.py
Outdated
Returns: | ||
Information about the Ray jobs in the cluster, | ||
namely a list of dicts with keys: | ||
- "JobID" (sha1 identifier for the job), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "JobID" (sha1 identifier for the job), | |
- "JobID" (identifier for the job), |
It's just random bytes, and will probably be shortened soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -2432,6 +2432,9 @@ def test_global_state_api(shutdown_only): | |||
with pytest.raises(Exception): | |||
ray.nodes() | |||
|
|||
with pytest.raises(Exception): | |||
ray.jobs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be changed to make sure it actually raises the error message we expect.
with pytest.raises(Exception, match="The ray global state API cannot be used before ray.init has been called."):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -475,10 +475,14 @@ std::string ProfileTable::DebugString() const { | |||
return Log<UniqueID, ProfileTableData>::DebugString(); | |||
} | |||
|
|||
Status JobTable::AppendJobData(const JobID &job_id, bool is_dead) { | |||
Status JobTable::AppendJobData(const JobID &job_id, bool is_dead, int64_t timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just compute the timestamp
inside this method instead of passing it in, but I don't really have a preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to have it outside, that makes it possible for somebody to use this function to use this function with a different timestamp than the current time.
Co-Authored-By: Robert Nishihara <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me assuming tests pass.
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
What do these changes do?
This exposes a list of jobs that are running or ran the Ray cluster to the global state API.
Related issue number
Linter
scripts/format.sh
to lint the changes in this PR.