Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-level settings not being passed on to jobs #163

Open
echeran opened this issue Jul 18, 2013 · 1 comment
Open

job-level settings not being passed on to jobs #163

echeran opened this issue Jul 18, 2013 · 1 comment
Labels

Comments

@echeran
Copy link

echeran commented Jul 18, 2013

(as from the mailing list: https://groups.google.com/forum/#!topic/cascalog-user/Rq_O33VsDyc )

I've come across similar issues of the options for child JVMs specified in with-job-conf not "sticking". I experienced GC issues in a reducer of one of my Cascalog jobs for the first time last week. I found the with-job-conf macro and wrapped the query execution form with it, to no avail:

(let [snk-qry-by-chan (for [chan channels]
                          (channel-query chan))
        all-snk-qry-seq (apply concat snk-qry-by-chan)]
    ;; configure the MapReduce child JVM options to avoid GC Overhead Limit err
    (with-job-conf {"mapred.child.java.opts" "-XX:-UseGCOverheadLimit -Xmx4g"}
      ;; execute all of the queries in parallel
      (apply ?- all-snk-qry-seq)))

The relevant parts of my project.clj

  :dependencies [[org.clojure/clojure "1.5.1"]
                 [cascalog "1.10.1"]
                 [incanter "1.4.1"]]
  :repositories {"cloudera" "https://repository.cloudera.com/artifactory/cloudera-repos"}
  :profiles {:provided {:dependencies [[org.apache.hadoop/hadoop-core  "0.20.2-cdh3u5"]]}}

But from the logging output from the reducer in question, regardless of what I specified in with-job-conf, I always saw this:

2013-07-12 17:25:55,216 INFO cascading.flow.hadoop.FlowMapper: child jvm opts: -Xmx1073741824

Further details:

  • We're running a Cloudera distribution (v 4.1.4) of Hadoop, and the version of Hadoop is 2.0.0.
  • I'm running Cascalog in cluster mode (I uberjar the code whenever I deploy).
  • The exception being thrown from the JVM is a GC Overhead Limit exceeded (as opposed to something like OutOfMemoryError).
  • (new detail as of 7/18/13) I've noticed that the with-job-conf does pass at least some other jobconf settings. The only example I've noticed clearly is, in the with-job-conf map, I had a key of "io.compression.codecs" and the value was a string containing "com.hadoop.compression.lzo.LzopCodec", which does not exist on our installation, and I got an error .

I saw Robin's workaround, which seems to just modify the site-hadoop.xml. It would be great if the with-job-conf settings "stuck" so as not to have to tweak site settings for per-job needs (especially since I don't manage the Hadoop cluster).

@mjwillson
Copy link

I've noticed (perhaps?) related issues in pure Cascading. Configuration properties supplied to the FlowConnector don't always get passed into the JobConf, the behaviour seems inconsistent and unpredictable. Would be good to have visibility and explicit guaranteed control over the JobConf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants