[SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property #22289

adambalogh · 2018-08-30T19:14:48Z

What changes were proposed in this pull request?

When submitting applications to Yarn in cluster mode, using the InProcessLauncher, spark finds the cluster's configuration files based on the HADOOP_CONF_DIR/YARN_CONF_DIR environment variables. This does not make it possible to submit to more than one Yarn clusters concurrently using the InProcessLauncher.

This PR adds a new property spark.hadoop.conf.dir that lets users select the location of the config files for each submission separately.

How was this patch tested?

Integration test

cc @vanzin @jerryshao

AmplabJenkins · 2018-08-30T19:17:53Z

Can one of the admins verify this patch?

jerryshao · 2018-08-31T02:58:21Z

launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java

@@ -200,6 +200,7 @@ void addOptionString(List<String> cmd, String options) {

    addToClassPath(cp, getenv("HADOOP_CONF_DIR"));
    addToClassPath(cp, getenv("YARN_CONF_DIR"));
+    addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir"));


I'm wondering how do we update the classpath to change to another hadoop confs with InProcessLauncher? Seems the classpath here is not changeable after JVM is launched.

On another note, is this meant to extend to other resource-managers? As Kubernetes assumes only the ENV HADOOP_CONF_DIR, but if such a change is desirable this would cause a slight re-work of the current Hadoop Conf Files mounting logic.
i.e. "spark.mesos.conf.dir" and "spark.kubernetes.conf.dir"

@jerryshao My understanding is that this method is not used by the InProcessLauncher. So instead, the caller of InProcessLauncher has to make sure that the conf files are available to hadoop's Configuration class in the YarnClusterApplication. For example, by adding the config files to the calling thread's context class loader

Saisai's question about the classpath configuration is actually the most complicated part of this feature. I haven't fully thought about how they would play out, but I really don't think it's as simple as appending this new config to the classpath.

e.g. what is the expectation if you run "spark-shell" with this option? Do you end up using the config from the env variable or from the config? If you have both, and you reference a file in --files that is on an HDFS namespace declared in the hdfs-site.xml from the config, what will happen? (Answer: it will be ignored, since that is being masked by the hdfs-site.xml from the env variable.)

Yes, it is quite tricky. My expectation is that it would behave the same way as if you pointed HADOOP_CONF_DIR and YARN_CONF_DIR to different directories that both contain hdfs-site.xml. Files in HADOOP_CONF_DIR would take precedence (as far as I know, nothing prevents this from happening). So with this new config, the order of preference would be HADOOP_CONF_DIR, YARN_CONF_DIR, then spark.yarn.conf.dir.

Perhaps I could clarify this in the docs, but let me know what you think about it, I'm happy to implement other resolutions.

@vanzin Did you have time to think about how this config should work?

adambalogh · 2018-08-31T16:07:53Z

@ifilonenko I don't think we need to extend this to other resource managers, since users of Kubernetes/Mesos RMs probably don't want to use use multiple Hadoop clusters at the same time that often. Or did I misunderstand your question?

EDIT: I see your point, maybe we should just call it spark.hadoop.conf.dir ?

vanzin · 2018-09-07T20:41:32Z

spark.hadoop.* is not a good name. That's a special prefix in Spark that modifies any Hadoop Configuration object that Spark instantiates. That's the easy one.

The hard one is that your change doesn't seem to achieve what your PR description says. What you're doing is just uploading the contents of spark.hadoop.config.dir instead of HADOOP_CONF_DIR with your YARN app. That means a bunch of things:

the Client class is still using whatever Hadoop configuration is in the classpath to choose the YARN service that will actually run the app.
the uploaded config is actually added at the end of the classpath of the AM / executors; the RM places its own configuration before that in the classpath, so in the launched processes, you're still not going to be using the configuration you defined in spark.hadoop.conf.dir.
the configuration used by the Client class that I mention above is actually written to a separate file and also sent over to the AM / executors, and overlayed on top of the configuration (see SparkHadoopUtil.newConfiguration).

So to actually achieve what you want to do, you'd have to fix at least two things:

SparkHadoopUtil.newConfiguration
the way Client creates the YARN configuration (which is here)

Otherwise, this change isn't actually doing much that I can see.

adambalogh · 2018-09-11T15:42:28Z

Thank you for the detailed explanation! @vanzin

I agree with what you are saying, however I'm not sure about some of your points about configs, so I would like to find a common ground regarding how hadoop/yarn configuration is supposed to work.

Regarding your 3 points about how configs work, I agree with point 1, however for point 2, I failed to find documentation about the RM adding its own Hadoop config files to the AM/executors' classpath. Is that documented somewhere or is that configurable? I did some experimenting where I placed some invalid configurations in HADOOP_CONF_DIR's hdfs-site.xml (but not in the yarn Client's configs in the classpath), and the AM failed to start up, indicating that it's actually using the configs from LOCALIZED_HADOOP_CONF_DIR, which is based on the contents of HADOOP_CONF_DIR, and not from the RM's hdfs-site.xml which had the correct configuration.

For point 3, the yarn Client does distribute its own Hadoop configs to SPARK_HADOOP_CONF_FILE, which should be overlaid on top of AM/executors' configs as you said, however it seems like ApplicationMaster is actually not doing that, because it doesn’t use the newConfiguration instance method from SparkHadoopUtil, instead it uses the static newConfiguration method, which doesn’t do the overlaying. Is that intentional? It seems like it was introduced here

Sorry for the long comment, and please let me know if I got something wrong.

vanzin · 2018-09-11T18:24:57Z

I failed to find documentation about the RM adding its own Hadoop config files to the AM/executors' classpath

See Client.getYarnAppClasspath and Client.getMRAppClasspath.

however it seems like ApplicationMaster is actually not doing that, because it doesn’t use the newConfiguration instance method

That may have been intentional. The AM-specific code (which is in the ApplicationMaster class) should use the configuration provided by the YARN cluster, since its purpose is to talk to YARN and that's it. User-specific changes can actually break things (e.g. point the AM at a different HDFS than the one where files were staged).

The driver code, which also runs inside the AM process, overlays the user config in the Configuration used by the SparkContext.

This reverts commit f8e828c.

vanzin · 2018-12-10T22:34:53Z

@adambalogh if you're not planning to address the issues in this PR we should probably close it.

adambalogh · 2018-12-11T11:10:00Z

Sorry for the lack of response, we found that adding the right files to the inprocess spark-submit's classpath mostly does the job, so I think I'll leave this as it is now.

ibzx · 2020-09-03T14:35:47Z

Sorry for the lack of response, we found that adding the right files to the inprocess spark-submit's classpath mostly does the job, so I think I'll leave this as it is now.

@adambalogh How did you address this files in the classpath?

We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>

Adam Balogh added 7 commits August 30, 2018 14:05

read hadoop config dir from spark property

f285081

Merge remote-tracking branch 'upstream/master'

ae41a66

renames

503c985

test

5622511

remove SPARK_TEST_HADOOP_CONF_DIR

d60736a

move conf.dir to end of classpath

d690b26

use right prop in test

bed3f44

jerryshao reviewed Aug 31, 2018

View reviewed changes

rename to hadoop.conf.dir

f8e828c

adambalogh mentioned this pull request Sep 7, 2018

[SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property palantir/spark#407

Closed

Adam Balogh added 3 commits September 12, 2018 18:29

Revert "rename to hadoop.conf.dir"

8bea68b

This reverts commit f8e828c.

rename

3680c12

flatten + fix test

df45ff9

adambalogh closed this Dec 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property #22289

[SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property #22289

adambalogh commented Aug 30, 2018 •

edited

Loading

AmplabJenkins commented Aug 30, 2018

jerryshao Aug 31, 2018

ifilonenko Aug 31, 2018 •

edited

Loading

adambalogh Aug 31, 2018

vanzin Aug 31, 2018

adambalogh Aug 31, 2018 •

edited

Loading

adambalogh Sep 7, 2018

adambalogh commented Aug 31, 2018 •

edited

Loading

vanzin commented Sep 7, 2018

adambalogh commented Sep 11, 2018

vanzin commented Sep 11, 2018

vanzin commented Dec 10, 2018

adambalogh commented Dec 11, 2018

ibzx commented Sep 3, 2020

[SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property #22289

[SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property #22289

Conversation

adambalogh commented Aug 30, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

AmplabJenkins commented Aug 30, 2018

jerryshao Aug 31, 2018

Choose a reason for hiding this comment

ifilonenko Aug 31, 2018 • edited Loading

Choose a reason for hiding this comment

adambalogh Aug 31, 2018

Choose a reason for hiding this comment

vanzin Aug 31, 2018

Choose a reason for hiding this comment

adambalogh Aug 31, 2018 • edited Loading

Choose a reason for hiding this comment

adambalogh Sep 7, 2018

Choose a reason for hiding this comment

adambalogh commented Aug 31, 2018 • edited Loading

vanzin commented Sep 7, 2018

adambalogh commented Sep 11, 2018

vanzin commented Sep 11, 2018

vanzin commented Dec 10, 2018

adambalogh commented Dec 11, 2018

ibzx commented Sep 3, 2020

adambalogh commented Aug 30, 2018 •

edited

Loading

ifilonenko Aug 31, 2018 •

edited

Loading

adambalogh Aug 31, 2018 •

edited

Loading

adambalogh commented Aug 31, 2018 •

edited

Loading