Skip to content
methylene edited this page Oct 14, 2014 · 18 revisions

Sometimes you want to pass arguments to jobs, for example, as command line arguments to the hadoop jar command. You want this data available in the workers, that is, in the mappers and reducers that actually process the jobs.

This is sometimes called side data. One could also call it a parameterized job.

There are multiple ways to do this. One common way is to put this data in the JobConf, but the JobConf is not straighforward to use in cascalog. Another way is to use the distributed cache.

Thanks to late compilation, in cascalog there is another, easy way: Let the left-arrow form (<- [...] ...) close over the side data. In order for this to work, the side data must be serializable, for example, a string.

It is important to understand that there is no use in putting the side data in a dynamic var, or a var that holds an atom, as the workers will only ever see the root binding of the var, and the initial value of the atom. That is because the workers will only run the code that is (called from) "inside" the left-arrow form, but will not run the (binding) or (reset!) that happens "outside" or "before" the left-arrow form.

To add to the confusion, in local mode, when all your taps are vectors or lfs-textlines, putting side data in vars may actually work.

Example to show how it works:

(defn- main [& args]
 (let [prefix (first args)]
  (?- (hfs-textline "hdfs://user/prefixed-strings.txt")
      (<- [?prefixed-string]
          ((hfs-textline "hdfs://user/unprefixed-strings.txt") 
            :> ?unprefixed-string)
          (str prefix ?unprefixed-string :> ?prefixed-string)))))