-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25200][YARN] Allow specifying HADOOP_CONF_DIR as spark property #22289
Conversation
Can one of the admins verify this patch? |
@@ -200,6 +200,7 @@ void addOptionString(List<String> cmd, String options) { | |||
|
|||
addToClassPath(cp, getenv("HADOOP_CONF_DIR")); | |||
addToClassPath(cp, getenv("YARN_CONF_DIR")); | |||
addToClassPath(cp, getEffectiveConfig().get("spark.yarn.conf.dir")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering how do we update the classpath to change to another hadoop confs with InProcessLauncher? Seems the classpath here is not changeable after JVM is launched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On another note, is this meant to extend to other resource-managers? As Kubernetes assumes only the ENV HADOOP_CONF_DIR
, but if such a change is desirable this would cause a slight re-work of the current Hadoop Conf Files mounting logic.
i.e. "spark.mesos.conf.dir"
and "spark.kubernetes.conf.dir"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerryshao My understanding is that this method is not used by the InProcessLauncher. So instead, the caller of InProcessLauncher has to make sure that the conf files are available to hadoop's Configuration class in the YarnClusterApplication. For example, by adding the config files to the calling thread's context class loader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saisai's question about the classpath configuration is actually the most complicated part of this feature. I haven't fully thought about how they would play out, but I really don't think it's as simple as appending this new config to the classpath.
e.g. what is the expectation if you run "spark-shell" with this option? Do you end up using the config from the env variable or from the config? If you have both, and you reference a file in --files
that is on an HDFS namespace declared in the hdfs-site.xml
from the config, what will happen? (Answer: it will be ignored, since that is being masked by the hdfs-site.xml
from the env variable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is quite tricky. My expectation is that it would behave the same way as if you pointed HADOOP_CONF_DIR
and YARN_CONF_DIR
to different directories that both contain hdfs-site.xml
. Files in HADOOP_CONF_DIR
would take precedence (as far as I know, nothing prevents this from happening). So with this new config, the order of preference would be HADOOP_CONF_DIR
, YARN_CONF_DIR
, then spark.yarn.conf.dir
.
Perhaps I could clarify this in the docs, but let me know what you think about it, I'm happy to implement other resolutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vanzin Did you have time to think about how this config should work?
@ifilonenko I don't think we need to extend this to other resource managers, since users of Kubernetes/Mesos RMs probably don't want to use use multiple Hadoop clusters at the same time that often. Or did I misunderstand your question? EDIT: I see your point, maybe we should just call it |
The hard one is that your change doesn't seem to achieve what your PR description says. What you're doing is just uploading the contents of
So to actually achieve what you want to do, you'd have to fix at least two things:
Otherwise, this change isn't actually doing much that I can see. |
Thank you for the detailed explanation! @vanzin I agree with what you are saying, however I'm not sure about some of your points about configs, so I would like to find a common ground regarding how hadoop/yarn configuration is supposed to work. Regarding your 3 points about how configs work, I agree with point 1, however for point 2, I failed to find documentation about the RM adding its own Hadoop config files to the AM/executors' classpath. Is that documented somewhere or is that configurable? I did some experimenting where I placed some invalid configurations in For point 3, the yarn Sorry for the long comment, and please let me know if I got something wrong. |
See
That may have been intentional. The AM-specific code (which is in the The driver code, which also runs inside the AM process, overlays the user config in the |
This reverts commit f8e828c.
@adambalogh if you're not planning to address the issues in this PR we should probably close it. |
Sorry for the lack of response, we found that adding the right files to the inprocess spark-submit's classpath mostly does the job, so I think I'll leave this as it is now. |
@adambalogh How did you address this files in the classpath? |
We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Robert Kruszewski <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
We use the InProcessLauncher internally [1] to launch to different YARN clusters. The clusters might need different configuration files, which we can't keep apart if the InProcessLauncher discovers config folders from the same HADOOP_CONF_DIR environment variable. This change allows us to specify different config directories using Spark config. See upstream PR [2] and ticket [3]. [1] https://pl.ntr/1UK [2] apache#22289 [1] https://issues.apache.org/jira/browse/SPARK-25200 Co-authored-by: Adam Balogh <[email protected]> Co-authored-by: Josh Casale <[email protected]> Co-authored-by: Will Raschkowski <[email protected]>
What changes were proposed in this pull request?
When submitting applications to Yarn in cluster mode, using the InProcessLauncher, spark finds the cluster's configuration files based on the HADOOP_CONF_DIR/YARN_CONF_DIR environment variables. This does not make it possible to submit to more than one Yarn clusters concurrently using the InProcessLauncher.
This PR adds a new property
spark.hadoop.conf.dir
that lets users select the location of the config files for each submission separately.How was this patch tested?
Integration test
cc @vanzin @jerryshao