Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

HDFS access umbrella issue #128

Open
ssuchter opened this issue Feb 17, 2017 · 8 comments
Open

HDFS access umbrella issue #128

ssuchter opened this issue Feb 17, 2017 · 8 comments
Labels

Comments

@ssuchter
Copy link
Member

We believe that accessing existing (remote) HDFS systems will be a common mode of input and output data. We want to support and test this mode of usage.

Some issues to pay attention to:

  • HDFS version support. Since one is supposed to use client libraries that correspond with the HDFS server version, this affects both our testing and packaging.
  • NameNode address support. Since its expected that many Spark apps will use the same HDFS (this is common usage) can we provide an option of how to specify the NameNode in a default configuration that means that each app config doesn't have to specify this. (Perhaps putting spark.hadoop.fs.defaultFS=hdfs://<host>:<port> in spark-defaults.conf is the right solution?)
  • Identity support. How will users identify their username to the HDFS. How will this be configurable? (For example export HADOOP_USER_NAME=<username> in conf/spark-env.sh)
  • Kerberos support. As an extension to the basic identity support, Kerberos is a commonly used mechanism to provide authentication of the identified user. When will we say we do/don't have support for Kerberos of these remote clusters.
  • Non-support of HDFS locality. Since we generally expect that these clusters are remote to the Kubernetes cluster, we don't expect to implement any kind of HDFS locality optimization.
    Other concerns?
@mgilham
Copy link

mgilham commented Feb 21, 2017

Regarding Kerberos support, we're currently running Spark on k8s in standalone mode, and needed to connect to a secure HDFS cluster. After considering a few approaches, we implemented the following:

  • A pod containing two containers:
    • One container that runs preexisting spark-yarn code to authenticate to HDFS with a Secret-supplied Kerberos keytab, generate/refresh Hadoop delegation tokens on a recurring schedule, and write the tokens to a shared local Volume. This preexisting code was essentially independent of YARN.
    • Another container that polls for changes to the local Volume and updates another k8s secret containing the delegation tokens.
  • A minor patch to Spark so its CredentialUpdater (and HDFSCredentialProvider) runs in the Spark driver and worker processes without YARN. This code picks up the k8s secret containing the delegation tokens.

Would be happy to share design details or code if this sounds like a strategy that would be generally useful and appropriate.

@ash211
Copy link

ash211 commented Feb 22, 2017

@mgilham thanks for sharing that approach to getting Kerberos authz support in Spark on k8s in standalone mode -- that's helpful to know that there's prior art.

@mccheah and I are unlikely to run into the Kerberized HDFS cluster problem for the next few months so this would be a great place for someone with a more immediate need to lead the way. If you're able to share further design details or code related to this approach I think that would make it much easier for someone to scratch their own itch and add this in.

@foxish foxish added the in-beta label Feb 23, 2017
@kimoonkim
Copy link
Member

kimoonkim commented Feb 28, 2017

  • NameNode address support.

Tested this successfully with spark.hadoop.fs.defaultFS hdfs://<host>:<port> put in spark-defaults.conf. Hadoop version was 2.6. This was without Kerberos.

Used something like below:

$ /usr/local/spark-on-k8s/bin/spark-submit  \
    --class org.apache.spark.examples.DFSReadWriteTest  \
    --conf spark.app.name=spark-dfstest  \
    /usr/local/spark-on-k8s/examples/jars/spark-examples_2.11-2.2.0-SNAPSHOT.jar  \
    LOCAL-FILE-PATH HDFS-DIR-PATH

The log says:

Success! Local Word Count (1) and DFS Word Count (1) agree.

Note I did not specify the HDFS user name, which means the above ran as the root user.

@kimoonkim
Copy link
Member

  • Identity support. How will users identify their username to the HDFS. How will this be configurable? (For example export HADOOP_USER_NAME= in conf/spark-env.sh)

I played with this a bit. I was hoping to find a way to pass the HADOOP_USER_NAME env var to the driver and executor pods, but I couldn't. It seems there is no easy way currently?

@kimoonkim
Copy link
Member

I was hoping to find a way to pass the HADOOP_USER_NAME env var to the driver and executor pods, but I couldn't.

FYI, this is the closest I got to. I issued the following command that uses spark.driver.extraJavaOptions to set the HADOOP_USER_NAME as a JVM property, (HADOOP_USER_NAME can be either a JVM property or an env var), and specifies spark.executorEnv for executors.

$ /usr/local/spark-on-k8s/bin/spark-submit --class org.apache.spark.examples.DFSReadWriteTest --conf spark.kubernetes.driverSubmitTimeout=360 --conf spark.app.name=spark-dfstest --conf spark.driver.extraJavaOptions="-DHADOOP_USER_NAME=kimoon" --conf spark.executorEnv.HADOOP_USER_NAME=kimoon /usr/local/spark-on-k8s/examples/jars/spark-examples_2.11-2.2.0-SNAPSHOT.jar /etc/hostname dfsreadwritetest-etc-hostname

The driver part worked and successfully created the HDFS dir under my user name. But executors failed with permission errors while trying to write in the dir as root.

2017-02-28 19:37:29 INFO TaskSetManager:54 - Lost task 1.3 in stage 0.0 (TID 7) on spark-dfstest-1488310560364-exec-1, executor 1: org.apache.hadoop.security.AccessControlException (Permission denied: user=root, access=WRITE, inode="/user/kimoon/dfsreadwritetest-etc-hostname/dfs_read_write_test/_temporary/0":kimoon:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257)

@lins05
Copy link

lins05 commented Mar 9, 2017

Non-support of HDFS locality. Since we generally expect that these clusters are remote to the Kubernetes cluster, we don't expect to implement any kind of HDFS locality optimization.

I think locality is pretty important for spark jobs on HDFS. Currently we run spark on mesos, with some mesos agents collocated with HDFS datanodes, and a good part of spark jobs are to process the data in HDFS. We used mesos's agent attributes to make sure spark executors are only launched on those agents (e.g. something like --conf spark.mesos.constraints=group:hdfs)

We can do similar things here, e.g. add some code to let the user specify custom nodeSelector for the executor pods, e.g. --conf spark.kubernetes.executor.nodeSelectors=group:hdfs

@ash211 ash211 changed the title Support & test remote HDFS access HDFS access umbrella issue Mar 17, 2017
@ash211
Copy link

ash211 commented Mar 17, 2017

Some preliminary work for basic HDFS support is happening in #130 to send the contents of HADOOP_CONF_DIR and YARN_CONF_DIR directories from submitter to the driver for use when accessing HDFS.

We expect this to cover two use cases I've seen in testing so far:

  • allow reading from hdfs:// URIs that refer to an HDFS cluster by nameservice, rather than NN FQDN, which enables reading from an HA NN cluster
  • allow writing to nameservice-based URIs as well, for e.g. spark.eventLog.dir values on an HDFS cluster

Obviously a lot more work to be done, but with these basics in place hopefully the group will be unblocked for more richer interactions with HDFS clusters, both in-kube and outside.

@kimoonkim
Copy link
Member

Kerberos support is being addressed by #379, #391.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants