Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12411] [CORE] Decrease executor heartbeat timeout to match heartbeat interval #10365

Closed
wants to merge 1 commit into from

Conversation

nongli
Copy link
Contributor

@nongli nongli commented Dec 17, 2015

Previously, the rpc timeout was the default network timeout, which is the same value
the driver uses to determine dead executors. This means if there is a network issue,
the executor is determined dead after one heartbeat attempt. There is a separate config
for the heartbeat interval which is a better value to use for the heartbeat RPC. With
this change, the executor will make multiple heartbeat attempts even with RPC issues.

…rtbeat interval.

Previously, the rpc timeout was the default network timeout, which is the same value
the driver uses to determine dead executors. This means if there is a network issue,
the executor is determined dead after one heartbeat attempt. There is a separate config
for the heartbeat interval which is a better value to use for the heartbeat RPC. With
this change, the executor will make multiple heartbeat attempts even with RPC issues.
@andrewor14
Copy link
Contributor

Is the issue you're trying to fix the following: there can be up to 12 parallel heartbeats to the driver from each executor? Other than that, is there any other change in behavior that you're expecting as of this patch?

@SparkQA
Copy link

SparkQA commented Dec 18, 2015

Test build #47956 has finished for PR 10365 at commit c899481.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nongli
Copy link
Contributor Author

nongli commented Dec 18, 2015

I think there can only be 1 heartbeat in flight but because of the network timeout, the executor only gets 1 try instead of 12.

@zsxwing
Copy link
Member

zsxwing commented Dec 18, 2015

there can be up to 12 parallel heartbeats to the driver from each executor?

12 parallel heartbeats? Why there are so many heartbeats in flight?

@andrewor14
Copy link
Contributor

12 parallel heartbeats? Why there are so many heartbeats in flight?

There can't be because we use a single thread for the heartbeat; it was my misunderstanding.

@andrewor14
Copy link
Contributor

LGTM merging into master.

@asfgit asfgit closed this in 0514e8d Dec 19, 2015
@nongli nongli deleted the spark-12411 branch December 21, 2015 20:20
asfgit pushed a commit that referenced this pull request Dec 24, 2015
…tbeat interval

Previously, the rpc timeout was the default network timeout, which is the same value
the driver uses to determine dead executors. This means if there is a network issue,
the executor is determined dead after one heartbeat attempt. There is a separate config
for the heartbeat interval which is a better value to use for the heartbeat RPC. With
this change, the executor will make multiple heartbeat attempts even with RPC issues.

Author: Nong Li <[email protected]>

Closes #10365 from nongli/spark-12411.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants