Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

Ship Hadoop configuration files to the driver and add to its classpath #130

Open
mccheah opened this issue Feb 21, 2017 · 8 comments
Open
Assignees

Comments

@mccheah
Copy link

mccheah commented Feb 21, 2017

Currently we do not do this, so the only way Hadoop configuration options can be set is by setting spark.hadoop.* parameters on the Spark configuration.

@ash211
Copy link

ash211 commented Jun 9, 2017

@kimoonkim as you've been testing the HDFS locality changes recently, are you passing the *-site.xml config files into Spark in some way? Are you passing all the configuration as spark.hadoop.*, or is there no required config in the clusters you're testing?

@kimoonkim
Copy link
Member

@ash211 Good question. For the HDFS node-level locality tests so far, Spark only needed the namenode address. I passed it as spark.hadoop.fs.defaultFS.

But we plan to work on the rack locality part soon, which involves more. Spark driver needs several config keys for the rack topology plugin. It also needs to access a script or text file that topology plugins refer to. (There are multiple topology plugin choices) Those files are usually in the same hadoop conf dir.

So It would be better to pass *-site.xml and other files in hadoop conf dir. I couldn't come up with any good approach yet, though. Do you have specific ideas?

@mccheah
Copy link
Author

mccheah commented Jun 9, 2017

We can build a ConfigMap instance, or allow the user to specify an existing one, which contains the core-site.xml that the job should use. Then, based on the path we mount the files to, we can set the HADOOP_CONF_DIR environment variable accordingly on our containers.

Depending on whether or not we expect core-site.xml to contain sensitive data, we might want to use a Secret instead.

@ifilonenko ifilonenko self-assigned this Jul 11, 2017
@ifilonenko
Copy link
Member

ifilonenko commented Jul 11, 2017

In what cases would we see core-site.xml or hdfs-site.xml containing sensitive data that might need for it to be contained in a secret? Any thoughts on why ConfigMap wouldn't work, or why such .xml files cant simply be distributed via the resource staging server?

@mccheah
Copy link
Author

mccheah commented Jul 11, 2017

One case where the XML files might have sensitive data is when configuring Spark to communicate with S3. In those cases these XML files might contain AWS credentials.

@ifilonenko
Copy link
Member

If the user isn't specifying an existing ConfigMap how should the user be expected to specify the file locations which the ConfigMap will use in the creation step via the Submission Client.

@mccheah
Copy link
Author

mccheah commented Jul 11, 2017

The submission client can set it for the user and just set HADOOP_CONF_DIR for the user accordingly as well.

@ifilonenko
Copy link
Member

#373 Should handle this

ifilonenko pushed a commit to ifilonenko/spark that referenced this issue Feb 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants