-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3398] [EC2] Have spark-ec2 intelligently wait for specific cluster states #2339
Conversation
Depending on what the reviewers think, there are some additional lines that can be removed, like:
|
QA tests have started for PR 2339 at commit
|
QA tests have finished for PR 2339 at commit
|
@JoshRosen and @davies: This PR is ready for review. |
if all(i.state == 'running' for i in cluster_instances) and \ | ||
all(is_ssh_available(host=i.ip_address, opts=opts) for i in cluster_instances): | ||
print "" # so that next line of output starts on new line | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could put "break" here, and put 'sys.stdout.write("\n")' at the end of this funciton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll do that but put print ""
at the end instead of sys.stdout.write("\n")
, since we want to flush the output immediately. Does that sound good to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The long comment sounds boring than 'sys.stdout.write("\n")'. With 'print ""', you also need to flush manually. I think we do not need to flush here, because it has no visually changes until other loggings coming in.
Either is fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I'll get back to you on this. Definitely a minor point either way, but I'd like to get it right. Gotta run for now but will check on this later tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see your point. I've made the change per your recommendation.
This patch look good to me, just some minor comments. I think we should keep --wait option (do not break user) and deprecate it. The dead code ( wait_for_cluster and wait_for_instances) should be removed. @JoshRosen how do you think? |
Alright, I've updated things per all the feedback I've gotten with one minor exception, which I will revisit later tonight. |
Jenkins, retest this please. |
QA tests have started for PR 2339 at commit
|
QA tests have started for PR 2339 at commit
|
QA tests have finished for PR 2339 at commit
|
QA tests have finished for PR 2339 at commit
|
QA tests have started for PR 2339 at commit
|
QA tests have finished for PR 2339 at commit
|
Flume test failed. My most recent commit just removed a couple of comment lines. Jenkins, retest this please. |
Jenkins, retest this please. |
QA tests have started for PR 2339 at commit
|
QA tests have started for PR 2339 at commit
|
QA tests have finished for PR 2339 at commit
|
@davies This PR is ready for another review. I believe I've covered all the feedback given so far. |
@nchammas LGTM, thanks! |
QA tests have finished for PR 2339 at commit
|
QA tests have started for PR 2339 at commit
|
Ah, OK, I just took care of that. An earlier version of that text would've put the line at over 100 characters, which is why it originally had those breaks in there. |
QA tests have started for PR 2339 at commit
|
QA tests have finished for PR 2339 at commit
|
Test PASSed. |
QA tests have finished for PR 2339 at commit
|
Test PASSed. |
Hey @pwendell, is there anything else you'd like to change about this PR? |
BTW @shivaram I took your suggestion of checking the status checks before checking SSH itself. |
8b701d1
to
43a69f0
Compare
Test FAILed. |
Dunno where this most recent failure came from. The build output just shows a failure to checkout the patch. |
Jenkins, retest this please. |
QA tests have started for PR 2339 at commit
|
QA tests have finished for PR 2339 at commit
|
Test PASSed. |
FYI: I believe I have these phantom new class notes finally sorted out in #2606. |
@pwendell Is this good to merge in? |
BTW, related side note: I just used Packer to create an example image from a template, and it looks like Packer follows a similar pattern of first waiting for the instance to come up, and then waiting for SSH to become available.
Just thought that was cool. |
This looks good to me, so I'm going to merge it. Thanks for doing this! |
This PR re-introduces [0e648bc](0e648bc) from PR #2339, which somehow never made it into the codebase. Additionally, it removes a now-unnecessary linear backoff on the SSH checks since we are blocking on EC2 status checks before testing SSH. Author: Nicholas Chammas <[email protected]> Closes #3195 from nchammas/remove-ec2-ssh-backoff and squashes the following commits: efb29e1 [Nicholas Chammas] Revert "Remove linear backoff." ef3ca99 [Nicholas Chammas] reuse conn adb4eaa [Nicholas Chammas] Remove linear backoff. 55caa24 [Nicholas Chammas] Check EC2 status checks before SSH.
Instead of waiting arbitrary amounts of time for the cluster to reach a specific state, this patch lets
spark-ec2
explicitly wait for a cluster to reach a desired state.This is useful in a couple of situations:
This patch removes the need for the
--wait
option and removes some of the time-based retry logic that was being used.