Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property #22289

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -286,9 +286,11 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S

if (master.startsWith("yarn")) {
val hasHadoopEnv = env.contains("HADOOP_CONF_DIR") || env.contains("YARN_CONF_DIR")
if (!hasHadoopEnv && !Utils.isTesting) {
val hasHadoopProp = sparkProperties.contains("spark.yarn.conf.dir")
if (!hasHadoopEnv && !hasHadoopProp && !Utils.isTesting) {
error(s"When running with master '$master' " +
"either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.")
"either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment, +" +
"or spark.yarn.conf.dir in the spark properties.")
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,7 @@ List<String> buildClassPath(String appClassPath) throws IOException {

addToClassPath(cp, getenv("HADOOP_CONF_DIR"));
addToClassPath(cp, getenv("YARN_CONF_DIR"));
addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how do we update the classpath to change to another hadoop confs with InProcessLauncher? Seems the classpath here is not changeable after JVM is launched.

Copy link
Contributor

@ifilonenko ifilonenko Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On another note, is this meant to extend to other resource-managers? As Kubernetes assumes only the ENV HADOOP_CONF_DIR, but if such a change is desirable this would cause a slight re-work of the current Hadoop Conf Files mounting logic.
i.e. "spark.mesos.conf.dir" and "spark.kubernetes.conf.dir"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerryshao My understanding is that this method is not used by the InProcessLauncher. So instead, the caller of InProcessLauncher has to make sure that the conf files are available to hadoop's Configuration class in the YarnClusterApplication. For example, by adding the config files to the calling thread's context class loader

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saisai's question about the classpath configuration is actually the most complicated part of this feature. I haven't fully thought about how they would play out, but I really don't think it's as simple as appending this new config to the classpath.

e.g. what is the expectation if you run "spark-shell" with this option? Do you end up using the config from the env variable or from the config? If you have both, and you reference a file in --files that is on an HDFS namespace declared in the hdfs-site.xml from the config, what will happen? (Answer: it will be ignored, since that is being masked by the hdfs-site.xml from the env variable.)

Copy link
Author

@adambalogh adambalogh Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is quite tricky. My expectation is that it would behave the same way as if you pointed HADOOP_CONF_DIR and YARN_CONF_DIR to different directories that both contain hdfs-site.xml. Files in HADOOP_CONF_DIR would take precedence (as far as I know, nothing prevents this from happening). So with this new config, the order of preference would be HADOOP_CONF_DIR, YARN_CONF_DIR, then spark.yarn.conf.dir.

Perhaps I could clarify this in the docs, but let me know what you think about it, I'm happy to implement other resolutions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin Did you have time to think about how this config should work?

addToClassPath(cp, getenv("SPARK_DIST_CLASSPATH"));
return new ArrayList<>(cp);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -694,24 +694,21 @@ private[spark] class Client(
}
}

// SPARK-23630: during testing, Spark scripts filter out hadoop conf dirs so that user's
// environments do not interfere with tests. This allows a special env variable during
// tests so that custom conf dirs can be used by unit tests.
val confDirs = Seq("HADOOP_CONF_DIR", "YARN_CONF_DIR") ++
(if (Utils.isTesting) Seq("SPARK_TEST_HADOOP_CONF_DIR") else Nil)

confDirs.foreach { envKey =>
sys.env.get(envKey).foreach { path =>
val dir = new File(path)
if (dir.isDirectory()) {
val files = dir.listFiles()
if (files == null) {
logWarning("Failed to list files under directory " + dir)
} else {
files.foreach { file =>
if (file.isFile && !hadoopConfFiles.contains(file.getName())) {
hadoopConfFiles(file.getName()) = file
}
val confDirsEnvKeys = Seq("HADOOP_CONF_DIR", "YARN_CONF_DIR")
val configDirProp = sparkConf.getOption("spark.yarn.conf.dir")

val confDirPaths = (confDirsEnvKeys.map(sys.env.get) :+ configDirProp).flatMap(_.toList)
confDirPaths.foreach { path =>
logDebug("Reading config files from " + path)
val dir = new File(path)
if (dir.isDirectory()) {
val files = dir.listFiles()
if (files == null) {
logWarning("Failed to list files under directory " + dir)
} else {
files.foreach { file =>
if (file.isFile && !hadoopConfFiles.contains(file.getName())) {
hadoopConfFiles(file.getName()) = file
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -131,8 +131,9 @@ class YarnClusterSuite extends BaseYarnClusterSuite {
val finalState = runSpark(false,
mainClassName(YarnClusterDriverUseSparkHadoopUtilConf.getClass),
appArgs = Seq("key=value", "spark.test.key=testvalue", result.getAbsolutePath()),
extraConf = Map("spark.hadoop.key" -> "value"),
extraEnv = Map("SPARK_TEST_HADOOP_CONF_DIR" -> customConf.getAbsolutePath()))
extraConf = Map(
"spark.hadoop.key" -> "value",
"spark.yarn.conf.dir" -> customConf.getAbsolutePath))
checkResult(finalState, result)
}

Expand Down