-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Job Submission] Add stop API to http & sdk, with better status code + stacktrace #20094
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Don't think the changes to data_types
are necessary & not sure about the DELETE
method for stopping jobs though.
dashboard/modules/job/data_types.py
Outdated
@dataclass | ||
class GetPackageRequest: | ||
package_uri: str | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think the changes to this file are necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kinda overfitting to current approach to reuse request validation logic, but i agree we probably don't need to add request types for most of these for a long time ..
dashboard/modules/job/job_head.py
Outdated
|
||
return aiohttp.web.Response() | ||
return aiohttp.web.Response( | ||
text={"success": True}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we ever set success = False
? otherwise probably don't need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't really use it, status code is only thing we need
dashboard/modules/job/job_head.py
Outdated
status=aiohttp.web.HTTPOk.status_code, | ||
) | ||
|
||
@routes.delete(JOBS_API_ROUTE_STOP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is really a DELETE
because we don't clean up the job record. I don't know exactly what the idiomatic method to choose here would be though. @tchordia any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our "stop" api is fire and forget so hard to guarantee to be idempotent. POST
makes sense too.
ex_message = str(e.value) | ||
assert "400 Client Error" in ex_message | ||
assert "TypeError: __init__() got an unexpected keyword argument" in ex_message # noqa: E501 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use the match
kwarg to pytest.raises
to do this for you as well
entrypoint="echo hello", | ||
runtime_env={"working_dir": "s3://does_not_exist"}) | ||
ex_message = str(e.value) | ||
assert "500 Server Error" in ex_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a 400, not a 500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(500 means something went wrong, like an internal assertion failed or unexpected error occurred)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wrapped everything within job_manager.submit_job()
in 500 to wrap everything happens at job manager / ray level, for this one in particular it fails at validation.py
for not having a .zip suffix for s3 path, maybe we can add a few more cases and see. Probably not all of them are 500
if isinstance(result, aiohttp.web.Response): | ||
return result | ||
else: | ||
submit_request = result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little bit ugly that we have to duplicate this code everywhere. One alternative would be to do this with a context manager:
with exception_status_code(500):
# Do stuff, if exception is raised, we return response with 500 and traceback.
No need to do it in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh nice this is definitely better than what we have here. i found it a bit weird towards the end as well but going too far to bother to change it back lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I give that a try and i think current dashboard modules don't really respect exceptions thrown from our own context manager and we can only do it by returning a response but set exception code. So i will leave it as is for now :/
dashboard/modules/job/data_types.py
Outdated
class UploadPackageRequest: | ||
package_uri: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't think this is used?
Diff Summary
After landing internal implementation of stopping jobs, we can now expose it to external user.
In addition, we have problems with current HTTP endpoint as it will NOT set correct return code or surface exception back to the caller or job sdk. Therefore in this diff, it also enhanced return code and exception propagation for both http and sdk.
According to https://docs.aiohttp.org/en/latest/web_exceptions.html , we now explicitly set
Test Plan
Added tests to:
So everything that went wrong before driver script runs are surfaced immediately via http / sdk; everything failed in driver script goes to dedicated log files and can be queried via job api.
Related issue number
Closes #19618 #19614
Checks
scripts/format.sh
to lint the changes in this PR.