Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Job Submission] Add stop API to http & sdk, with better status code + stacktrace #20094

Merged
merged 7 commits into from
Nov 6, 2021

Conversation

jiaodong
Copy link
Member

@jiaodong jiaodong commented Nov 5, 2021

Diff Summary

After landing internal implementation of stopping jobs, we can now expose it to external user.

In addition, we have problems with current HTTP endpoint as it will NOT set correct return code or surface exception back to the caller or job sdk. Therefore in this diff, it also enhanced return code and exception propagation for both http and sdk.

According to https://docs.aiohttp.org/en/latest/web_exceptions.html , we now explicitly set

  • 200 - HTTPOk (Everything goes through)
  • 400 - HTTPBadRequest (Request validation and conversion failed)
  • 500 - HTTPInternalServerError (Ray internal execution failed)

Test Plan

Added tests to:

  • Test e2e stop flow
  • Bad request status code + exception message validation
  • stderr via logs if failure happened in driver

So everything that went wrong before driver script runs are surfaced immediately via http / sdk; everything failed in driver script goes to dedicated log files and can be queried via job api.

Related issue number

Closes #19618 #19614

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Don't think the changes to data_types are necessary & not sure about the DELETE method for stopping jobs though.

Comment on lines 20 to 24
@dataclass
class GetPackageRequest:
package_uri: str


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think the changes to this file are necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kinda overfitting to current approach to reuse request validation logic, but i agree we probably don't need to add request types for most of these for a long time ..


return aiohttp.web.Response()
return aiohttp.web.Response(
text={"success": True},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we ever set success = False? otherwise probably don't need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't really use it, status code is only thing we need

status=aiohttp.web.HTTPOk.status_code,
)

@routes.delete(JOBS_API_ROUTE_STOP)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is really a DELETE because we don't clean up the job record. I don't know exactly what the idiomatic method to choose here would be though. @tchordia any thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our "stop" api is fire and forget so hard to guarantee to be idempotent. POST makes sense too.

Comment on lines +127 to +129
ex_message = str(e.value)
assert "400 Client Error" in ex_message
assert "TypeError: __init__() got an unexpected keyword argument" in ex_message # noqa: E501
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use the match kwarg to pytest.raises to do this for you as well

entrypoint="echo hello",
runtime_env={"working_dir": "s3://does_not_exist"})
ex_message = str(e.value)
assert "500 Server Error" in ex_message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be a 400, not a 500

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(500 means something went wrong, like an internal assertion failed or unexpected error occurred)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wrapped everything within job_manager.submit_job() in 500 to wrap everything happens at job manager / ray level, for this one in particular it fails at validation.py for not having a .zip suffix for s3 path, maybe we can add a few more cases and see. Probably not all of them are 500

Comment on lines +104 to +107
if isinstance(result, aiohttp.web.Response):
return result
else:
submit_request = result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little bit ugly that we have to duplicate this code everywhere. One alternative would be to do this with a context manager:

with exception_status_code(500):
    # Do stuff, if exception is raised, we return response with 500 and traceback.

No need to do it in this PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh nice this is definitely better than what we have here. i found it a bit weird towards the end as well but going too far to bother to change it back lol

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I give that a try and i think current dashboard modules don't really respect exceptions thrown from our own context manager and we can only do it by returning a response but set exception code. So i will leave it as is for now :/

Comment on lines 29 to 30
class UploadPackageRequest:
package_uri: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this is used?

@edoakes edoakes merged commit 9ef75b2 into ray-project:master Nov 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add python api for job submission
3 participants