-
Notifications
You must be signed in to change notification settings - Fork 118
HDFS access umbrella issue #128
Comments
Regarding Kerberos support, we're currently running Spark on k8s in standalone mode, and needed to connect to a secure HDFS cluster. After considering a few approaches, we implemented the following:
Would be happy to share design details or code if this sounds like a strategy that would be generally useful and appropriate. |
@mgilham thanks for sharing that approach to getting Kerberos authz support in Spark on k8s in standalone mode -- that's helpful to know that there's prior art. @mccheah and I are unlikely to run into the Kerberized HDFS cluster problem for the next few months so this would be a great place for someone with a more immediate need to lead the way. If you're able to share further design details or code related to this approach I think that would make it much easier for someone to scratch their own itch and add this in. |
Tested this successfully with Used something like below:
The log says:
Note I did not specify the HDFS user name, which means the above ran as the |
I played with this a bit. I was hoping to find a way to pass the HADOOP_USER_NAME env var to the driver and executor pods, but I couldn't. It seems there is no easy way currently? |
FYI, this is the closest I got to. I issued the following command that uses
The driver part worked and successfully created the HDFS dir under my user name. But executors failed with permission errors while trying to write in the dir as root.
|
I think locality is pretty important for spark jobs on HDFS. Currently we run spark on mesos, with some mesos agents collocated with HDFS datanodes, and a good part of spark jobs are to process the data in HDFS. We used mesos's agent attributes to make sure spark executors are only launched on those agents (e.g. something like We can do similar things here, e.g. add some code to let the user specify custom nodeSelector for the executor pods, e.g. |
Some preliminary work for basic HDFS support is happening in #130 to send the contents of We expect this to cover two use cases I've seen in testing so far:
Obviously a lot more work to be done, but with these basics in place hopefully the group will be unblocked for more richer interactions with HDFS clusters, both in-kube and outside. |
We believe that accessing existing (remote) HDFS systems will be a common mode of input and output data. We want to support and test this mode of usage.
Some issues to pay attention to:
spark.hadoop.fs.defaultFS=hdfs://<host>:<port>
in spark-defaults.conf is the right solution?)export HADOOP_USER_NAME=<username>
in conf/spark-env.sh)Other concerns?
The text was updated successfully, but these errors were encountered: