Skip to content
Maverick Lou edited this page Jun 24, 2015 · 13 revisions

Drake provides HDFS support by allowing you to specify inputs and outputs like hdfs://my/big_file.txt.

Drake's default package includes a standard hadoop-core client library. However, there's a fair chance that your Hadoop cluster requires you to run a different versoin of the Hadoop client. Therefore, in order to make a best attempt at out-of-the-box HDFS support, the drake script automatically looks for your local Hadoop client and prefers to use that.

Use drake --hadoop-version to check which hadoop version drake client found.

If Drake cannot find a local Hadoop client, it will look at your HADOOP_CLASSPATH environment variable and use that if found.

If drake cannot find your local Hadoop client and your HADOOP_CLASSPATH is not set and the client library that ships with Drake is not compatible with your local Hadoop cluster, Drake will not be able to support HDFS for you. Any attempts you make to use HDFS in your Drake workflows will result in errors such as this one:

ERROR java.io.IOException: Call to somehost/10.0.0.30:9000 failed on local exception: java.io.EOFException
        at org.apache.hadoop.ipc.Client.wrapException(Client.java:775)
        at org.apache.hadoop.ipc.Client.call(Client.java:743)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359)
        at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:207)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:170)
        at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82)
        at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:180)
        at drake.fs$hdfs_filesystem.invoke(fs.clj:144)
        at drake.fs.HDFS.exists_QMARK_(fs.clj:153)

If you're trying to get your Drake workflows to work with HDFS and you see errors like this you should:

  1. Attempt to find out why Drake was not able to locate your local Hadoop client. Fixing this should fix Drake's HDFS support for you.
  2. If (1) does not yield results, you can modify project.clj to specify the exact Hadoop client library version you need, rather than the default version. Then you can make your own build of Drake that should be compatible with your Hadoop cluster.