Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take into account queue length in autoscaling #5684

Merged
merged 5 commits into from
Sep 11, 2019

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Sep 11, 2019

Why are these changes needed?

Related issue number

Checks

@ericl ericl changed the title [WIP] Take into account queue length in autoscaling Take into account queue length in autoscaling Sep 11, 2019
@@ -357,6 +360,7 @@ def run(self):
try:
self._run()
except Exception:
logger.exception("Error in monitor loop")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't do this, the error message is lost forever trying to terminate nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean nodes probably weren't cleaned up? If so, would be good to print that in the error message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's done below actually

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16964/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16966/
Test PASSed.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

src/ray/raylet/scheduling_queue.cc Outdated Show resolved Hide resolved
@@ -142,45 +142,56 @@ def terminate_node(self, node_id):
class LoadMetricsTest(unittest.TestCase):
def testUpdate(self):
lm = LoadMetrics()
lm.update("1.1.1.1", {"CPU": 2}, {"CPU": 1})
lm.update("1.1.1.1", {"CPU": 2}, {"CPU": 1}, {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some test cases for the new metric

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there are a couple entries below in testLoadMessages

@@ -357,6 +360,7 @@ def run(self):
try:
self._run()
except Exception:
logger.exception("Error in monitor loop")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean nodes probably weren't cleaned up? If so, would be good to print that in the error message.

@@ -357,6 +360,7 @@ def run(self):
try:
self._run()
except Exception:
logger.exception("Error in monitor loop")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.exception("Error in monitor loop")
logger.exception("Error in monitor loop.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe in periods in log messages

@ericl ericl merged commit 2fdefe1 into ray-project:master Sep 11, 2019
@ericl
Copy link
Contributor Author

ericl commented Sep 11, 2019

Merging so I can test this for real on compiled wheels -- had some issues trying to set it up before.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16978/
Test FAILed.

@ericl
Copy link
Contributor Author

ericl commented Sep 12, 2019

I just tested this on a real cluster and it works as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants