HDFS access umbrella issue #128

ssuchter · 2017-02-17T18:14:59Z

We believe that accessing existing (remote) HDFS systems will be a common mode of input and output data. We want to support and test this mode of usage.

Some issues to pay attention to:

HDFS version support. Since one is supposed to use client libraries that correspond with the HDFS server version, this affects both our testing and packaging.
NameNode address support. Since its expected that many Spark apps will use the same HDFS (this is common usage) can we provide an option of how to specify the NameNode in a default configuration that means that each app config doesn't have to specify this. (Perhaps putting spark.hadoop.fs.defaultFS=hdfs://<host>:<port> in spark-defaults.conf is the right solution?)
Identity support. How will users identify their username to the HDFS. How will this be configurable? (For example export HADOOP_USER_NAME=<username> in conf/spark-env.sh)
Kerberos support. As an extension to the basic identity support, Kerberos is a commonly used mechanism to provide authentication of the identified user. When will we say we do/don't have support for Kerberos of these remote clusters.
Non-support of HDFS locality. Since we generally expect that these clusters are remote to the Kubernetes cluster, we don't expect to implement any kind of HDFS locality optimization.
Other concerns?

The text was updated successfully, but these errors were encountered:

mgilham · 2017-02-21T22:17:39Z

Regarding Kerberos support, we're currently running Spark on k8s in standalone mode, and needed to connect to a secure HDFS cluster. After considering a few approaches, we implemented the following:

A pod containing two containers:
- One container that runs preexisting spark-yarn code to authenticate to HDFS with a Secret-supplied Kerberos keytab, generate/refresh Hadoop delegation tokens on a recurring schedule, and write the tokens to a shared local Volume. This preexisting code was essentially independent of YARN.
- Another container that polls for changes to the local Volume and updates another k8s secret containing the delegation tokens.
A minor patch to Spark so its CredentialUpdater (and HDFSCredentialProvider) runs in the Spark driver and worker processes without YARN. This code picks up the k8s secret containing the delegation tokens.

Would be happy to share design details or code if this sounds like a strategy that would be generally useful and appropriate.

ash211 · 2017-02-22T09:49:46Z

@mgilham thanks for sharing that approach to getting Kerberos authz support in Spark on k8s in standalone mode -- that's helpful to know that there's prior art.

@mccheah and I are unlikely to run into the Kerberized HDFS cluster problem for the next few months so this would be a great place for someone with a more immediate need to lead the way. If you're able to share further design details or code related to this approach I think that would make it much easier for someone to scratch their own itch and add this in.

kimoonkim · 2017-02-28T17:30:45Z

NameNode address support.

Tested this successfully with spark.hadoop.fs.defaultFS hdfs://<host>:<port> put in spark-defaults.conf. Hadoop version was 2.6. This was without Kerberos.

Used something like below:

$ /usr/local/spark-on-k8s/bin/spark-submit  \
    --class org.apache.spark.examples.DFSReadWriteTest  \
    --conf spark.app.name=spark-dfstest  \
    /usr/local/spark-on-k8s/examples/jars/spark-examples_2.11-2.2.0-SNAPSHOT.jar  \
    LOCAL-FILE-PATH HDFS-DIR-PATH

The log says:

Success! Local Word Count (1) and DFS Word Count (1) agree.

Note I did not specify the HDFS user name, which means the above ran as the root user.

kimoonkim · 2017-02-28T19:26:26Z

Identity support. How will users identify their username to the HDFS. How will this be configurable? (For example export HADOOP_USER_NAME= in conf/spark-env.sh)

I played with this a bit. I was hoping to find a way to pass the HADOOP_USER_NAME env var to the driver and executor pods, but I couldn't. It seems there is no easy way currently?

kimoonkim · 2017-02-28T19:50:30Z

I was hoping to find a way to pass the HADOOP_USER_NAME env var to the driver and executor pods, but I couldn't.

FYI, this is the closest I got to. I issued the following command that uses spark.driver.extraJavaOptions to set the HADOOP_USER_NAME as a JVM property, (HADOOP_USER_NAME can be either a JVM property or an env var), and specifies spark.executorEnv for executors.

$ /usr/local/spark-on-k8s/bin/spark-submit --class org.apache.spark.examples.DFSReadWriteTest --conf spark.kubernetes.driverSubmitTimeout=360 --conf spark.app.name=spark-dfstest --conf spark.driver.extraJavaOptions="-DHADOOP_USER_NAME=kimoon" --conf spark.executorEnv.HADOOP_USER_NAME=kimoon /usr/local/spark-on-k8s/examples/jars/spark-examples_2.11-2.2.0-SNAPSHOT.jar /etc/hostname dfsreadwritetest-etc-hostname

The driver part worked and successfully created the HDFS dir under my user name. But executors failed with permission errors while trying to write in the dir as root.

2017-02-28 19:37:29 INFO TaskSetManager:54 - Lost task 1.3 in stage 0.0 (TID 7) on spark-dfstest-1488310560364-exec-1, executor 1: org.apache.hadoop.security.AccessControlException (Permission denied: user=root, access=WRITE, inode="/user/kimoon/dfsreadwritetest-etc-hostname/dfs_read_write_test/_temporary/0":kimoon:supergroup:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:257)

lins05 · 2017-03-09T15:43:17Z

Non-support of HDFS locality. Since we generally expect that these clusters are remote to the Kubernetes cluster, we don't expect to implement any kind of HDFS locality optimization.

I think locality is pretty important for spark jobs on HDFS. Currently we run spark on mesos, with some mesos agents collocated with HDFS datanodes, and a good part of spark jobs are to process the data in HDFS. We used mesos's agent attributes to make sure spark executors are only launched on those agents (e.g. something like --conf spark.mesos.constraints=group:hdfs)

We can do similar things here, e.g. add some code to let the user specify custom nodeSelector for the executor pods, e.g. --conf spark.kubernetes.executor.nodeSelectors=group:hdfs

ash211 · 2017-03-17T08:15:33Z

Some preliminary work for basic HDFS support is happening in #130 to send the contents of HADOOP_CONF_DIR and YARN_CONF_DIR directories from submitter to the driver for use when accessing HDFS.

We expect this to cover two use cases I've seen in testing so far:

allow reading from hdfs:// URIs that refer to an HDFS cluster by nameservice, rather than NN FQDN, which enables reading from an HA NN cluster
allow writing to nameservice-based URIs as well, for e.g. spark.eventLog.dir values on an HDFS cluster

Obviously a lot more work to be done, but with these basics in place hopefully the group will be unblocked for more richer interactions with HDFS clusters, both in-kube and outside.

kimoonkim · 2017-08-01T19:28:12Z

Kerberos support is being addressed by #379, #391.

foxish added the in-beta label Feb 23, 2017

ash211 changed the title ~~Support & test remote HDFS access~~ HDFS access umbrella issue Mar 17, 2017

kimoonkim mentioned this issue Aug 1, 2017

Support SPARK_USER for specifying usernames to HDFS #408

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS access umbrella issue #128

HDFS access umbrella issue #128

ssuchter commented Feb 17, 2017

mgilham commented Feb 21, 2017

ash211 commented Feb 22, 2017 •

edited

Loading

kimoonkim commented Feb 28, 2017 •

edited

Loading

kimoonkim commented Feb 28, 2017

kimoonkim commented Feb 28, 2017

lins05 commented Mar 9, 2017 •

edited

Loading

ash211 commented Mar 17, 2017

kimoonkim commented Aug 1, 2017

HDFS access umbrella issue #128

HDFS access umbrella issue #128

Comments

ssuchter commented Feb 17, 2017

mgilham commented Feb 21, 2017

ash211 commented Feb 22, 2017 • edited Loading

kimoonkim commented Feb 28, 2017 • edited Loading

kimoonkim commented Feb 28, 2017

kimoonkim commented Feb 28, 2017

lins05 commented Mar 9, 2017 • edited Loading

ash211 commented Mar 17, 2017

kimoonkim commented Aug 1, 2017

ash211 commented Feb 22, 2017 •

edited

Loading

kimoonkim commented Feb 28, 2017 •

edited

Loading

lins05 commented Mar 9, 2017 •

edited

Loading