Skip to content
ruichao-factual edited this page Nov 25, 2013 · 5 revisions

This tutorial is a work in progress. If there's a specific topic you'd like covered, please let us know via the Google Group for Drake.

Overview

Your Drake workflow file specifies what steps you want to run. Generally speaking, each step relies on one or more input sources, and is expected to create one or more output artifacts.

A Drake workflow file is organized primarily by step. In addition to specifying inputs and outputs, a step will generally contain explicit commands for that step, and possibly extra options.

Here's an example of a single step in a Drake workflow file:

; we only like lines with lowercase "i" in them
out.csv <- in.csv [shell]
  grep i $INPUT > $OUTPUT

The above step uses Drake's "shell" protocol, meaning the commands are shell commands. (There are other protocols available, which must be specified explicitly. But for this tutorial, we'll focus on using the shell protocol.)

Let's break down the specific elements of the above step:

  • out.csv: the output file to produce
  • in.csv: the input file to use
  • [shell]: the brackets hold the options for the step. A very important option is the step protocol. In this case we're choosing the "shell" protocol, which allows us to run shell commands in this step.
  • the indented line: indented lines following the first line of the step are the commands of the step. In this case, there's exactly one command, which performs line filtering. Note that the command is a shell command, per our use of the shell protocol.
  • $INPUT: A Drake shell step automatically loads the shell environment variables with useful information before running the step's shell command(s). For example, it loads the INPUT environment variable with the file path of the first input specified by the step. Therefore, the step's shell commands have access to variables such as $INPUT.
  • $OUTPUT: Similar to $INPUT, a Drake shell step automatically loads the OUTPUT environment variable with the file path of the first output specified by the step, before running the step's shell commands.

Basic dependency management

A Drake workflow may have many steps, which might depend on each other in various ways. As a simple example, consider this additional step we could add to our workflow file:

; produce an extraordinarily fancy report
count.txt <- out.csv
  wc $INPUT > $OUTPUT 

This step depends on out.csv (that is, it uses out.csv as its input file), and produces count.txt. Because of the dependence on out.csv, Drake will by default make sure out.csv is up-to-date. This means Drake will run the step(s) required to create out.csv if necessary. (This behaviour is a tenet of basic dependency management that we've come to know and love through tools like Make.)

Drake's command line interface allows us to specify which step we want to start with, and other various target selection options. By default, however, Drake will attempt to run all the steps in your workflow.

For more details on Drake command line options, including target selection, please see the full user manual.

But we're getting ahead of ourselves. Let's learn by doing...

Your very first workflow

Drake is built to run data workflows. By default, it looks for your workflow file at ./Drakefile. This is why Drake will complain that it can't find your workflow file, if you run it from somewhere that does not have a ./Drakefile file.

Let's start with a fresh workflow, in a new directory:

$ mkdir /myworkflow
$ cd /myworkflow

Now create a simple workflow to play with. Create a file named workflow.d and put this in it (stolen from the earlier example above):

; we only like lines with lowercase "i" in them
out.csv <- in.csv
  grep i $INPUT > $OUTPUT

That's a very simple Drake workflow, with exactly one step. The step runs a single shell command, using in.csv as the input file and writing the output to output.csv.

We don't have an input file yet, so let's create it. Create a file named in.csv and put some CSV lines in it, like so:

Artem,Boytsov,artem
Aaron,Crow,aaron
Alvin,Chyan,alvin
Maverick,Lou,maverick
Vinnie,Pepi,vinnie
Will,Lao,will

Cool, now we have a Drake workflow and a simple input file on which to run the workflow. Let's run it!

$ drake -w workflow.d

Let's check the output:

$ more out.csv
Alvin,Chyan,alvin
Maverick,Lou,maverick
Vinnie,Pepi,vinnie
Will,Lao,will