Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline: links and examples update for 1.x #1584

Merged
merged 19 commits into from
Jul 29, 2020
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions content/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ positional arguments:

The `dvc commit` command is useful for several scenarios, when data already
tracked by DVC changes: when a [stage](/doc/command-reference/run) or
[pipeline](/doc/command-reference/pipeline) is in development/experimentation;
when manually editing or generating DVC <abbr>outputs</abbr>; or to force
DVC-file updates without reproducing stages or pipelines. These scenarios are
further detailed below.
[pipeline](/doc/command-reference/dag) is in development/experimentation; when
manually editing or generating DVC <abbr>outputs</abbr>; or to force DVC-file
updates without reproducing stages or pipelines. These scenarios are further
detailed below.

- Code or data for a stage is under active development, with multiple iterations
(experiments) in code, configuration, or data. Use the `--no-commit` option of
Expand Down
5 changes: 1 addition & 4 deletions content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ on DVC remotes.) These necessary data or model files are listed as
<abbr>dependencies</abbr> or <abbr>outputs</abbr> in a target
[stage](/doc/command-reference/run) (in `dvc.yaml`) or `.dvc` file, so they are
required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the
corresponding [pipeline](/doc/command-reference/pipeline).
corresponding [pipeline](/doc/command-reference/dag).

`dvc fetch` ensures that the files needed for a stage or `.dvc` file to be
[reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in
Expand Down Expand Up @@ -136,9 +136,6 @@ $ cd example-get-started

</details>

The workspace looks almost like in this
[pipeline setup](/doc/tutorials/pipelines):

```dvc
.
├── data
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/freeze.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ reproduction will not regenerate <abbr>outputs</abbr> of frozen stages, even if
their <abbr>dependencies</abbr> have changed, and even if `--force` is used.

Freezing a stage is useful to avoid syncing data from the top of its
[pipeline](/doc/command-reference/pipeline), and keep iterating on the last
[pipeline](/doc/command-reference/dag), and keep iterating on the last
(non-frozen) stages only.

Note that <abbr>import stages</abbr> are frozen by default. Use `dvc update` to
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,8 +194,8 @@ trying this example (especially if trying out the following one).

What if that remote file is updated regularly? The project goals might include
regenerating some results based on the updated data source.
[Pipeline](/doc/command-reference/pipeline) reproduction can be triggered based
on a changed external dependency.
[Pipeline](/doc/command-reference/dag) reproduction can be triggered based on a
changed external dependency.

Let's use the [Get Started](/doc/tutorials/get-started) project again,
simulating an updated external data source. (Remember to prepare the
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ DVC is a command line tool. The typical DVC workflow goes as follows:
`dvc run` command, along with its `--outs` option for <abbr>outputs</abbr>
that should also be tracked by DVC after the code is executed.
- Sharing a Git repository with the source code of your ML
[pipeline](/doc/command-reference/pipeline) will not include the project's
[pipeline](/doc/command-reference/dag) will not include the project's
<abbr>cache</abbr>. Use [remote storage](/doc/command-reference/remote) and
`dvc push` to share this cache (data tracked by DVC).
- Use `dvc repro` to automatically reproduce your full pipeline, iteratively as
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ This hook automates `dvc checkout` after `git checkout`.
**Commit/Reproduce**: Before committing DVC changes with Git, it may be
necessary using `dvc commit` to store new data files not yet in cache. Or the
changes might require reproducing the corresponding
[pipeline](/doc/command-reference/pipeline) (with `dvc repro`) to regenerate the
[pipeline](/doc/command-reference/dag) (with `dvc repro`) to regenerate the
project's results (which implicitly commits them to DVC as well).

This hook automates `dvc status` before `git commit` when needed, to remind the
Expand Down
43 changes: 31 additions & 12 deletions content/docs/command-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,8 +49,8 @@ used to see what files `dvc pull` would download.
If one or more `targets` are specified, DVC only considers the files associated
with those stages or `.dvc` files. Using the `--with-deps` option, DVC tracks
dependencies backward from the target [stage files](/doc/command-reference/run),
through the corresponding [pipelines](/doc/command-reference/pipeline), to find
data files to pull.
through the corresponding [pipelines](/doc/command-reference/dag), to find data
files to pull.

After a data file is in cache, `dvc pull` can use OS-specific mechanisms like
reflinks or hardlinks to put it in the workspace without copying. See
Expand Down Expand Up @@ -129,9 +129,6 @@ $ cd example-get-started

</details>

The workspace looks almost like in this
[pipeline setup](/doc/tutorials/pipelines):

```dvc
.
├── data
Expand Down Expand Up @@ -167,16 +164,38 @@ $ dvc pull train.dvc
> Please delete the `.dvc/cache` directory first (with `rm -Rf .dvc/cache`) to
> follow this example if you tried the previous ones.

Our [pipeline](/doc/command-reference/pipeline) has been setup with these
Our [pipeline](/doc/command-reference/dag) has been setup with these
[stages](/doc/command-reference/run):

```dvc
$ dvc pipeline show evaluate.dvc
data/data.xml.dvc
prepare.dvc
featurize.dvc
train.dvc
evaluate.dvc
$ dvc dag evaluate
+-------------------+
| data/data.xml.dvc |
+-------------------+
*
*
*
+---------+
| prepare |
+---------+
*
*
*
+-----------+
| featurize |
+-----------+
** **
** *
* **
+-------+ *
| train | **
+-------+ *
** **
** **
* *
+----------+
| evaluate |
+----------+
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the output here because it's a relatively less incorrect output than the previous one.

Unfortunately, the remote still isn't working. I cannot find the output of evaluate.dvc since it doesn't exist. Also, files in src/ are Git-tracked files, so I had to remove them from git. Even after adding all files (dvc add) in src/, dvc dag evaluate.py.dvc does not show the linked files either. I suppose I am missing something and having the remote changed would result in an accurate output. This output is the output I'm currently getting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's totally wrong 😋 please read the cmd ref at https://dvc.org/doc/command-reference/dag and notice the https://github.com/iterative/example-get-started project has been updated for 1.x already. There are no .dvc stage files, it's all in dvc.yaml. Have you followed the new Get Started at https://dvc.org/doc/start?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also you don't need to use a target in dvc dag, just the plan command should be enough... But it will print a diagram which may be too long. If it seems to (too long) just use cat dvc.yaml here as well. Thanks

Copy link
Contributor

@jorgeorpinel jorgeorpinel Jul 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. cat dvc.yaml will bee way too long here also. Let's try dvc dag with the right target please. Lmk if you need help... But do read those refs I shared, hopefully you'll get it from that. Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I've updated it with dvc dag evaluate, but it still gave the entire diagram as output.
Because the deps files link all the way back to prepare and featurize.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Jul 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Yeah I contradicted myself, sorry... You don't need to use a target in dvc dag. The plan command should show everything by default.

But I think we're gonna have to go with a simple list of stage names like in push, because this diagram is too huge.

The rest of the example will also need updating here, so it makes sense (similar to the changes in https://github.com/iterative/dvc.org/pull/1591/files). Please try to make sure the entire context makes sense and works. Thanks

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, I got your point. But a problem arises as stated in #1584(comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the issue. It's just a matter of updating the rest of the example please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other missing point but we can extract into a separate issue. Are you having trouble understanding what I meant here about updating the rest of the example @utkarshsingh99 ? Thanks


Imagine the [remote storage](/doc/command-reference/remote) has been modified
Expand Down
16 changes: 5 additions & 11 deletions content/docs/command-reference/push.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ cache (compared to the default remote.) It can be used to see what files
If one or more `targets` are specified, DVC only considers the files associated
with them. Using the `--with-deps` option, DVC tracks dependencies backward from
the target [stage files](/doc/command-reference/run), through the corresponding
[pipelines](/doc/command-reference/pipeline), to find data files to push.
[pipelines](/doc/command-reference/dag), to find data files to push.

## Options

Expand Down Expand Up @@ -151,21 +151,15 @@ $ dvc push data.zip.dvc
## Example: With dependencies

Demonstrating the `--with-deps` option requires a larger example. First, assume
a [pipeline](/doc/command-reference/pipeline) has been setup with these
a [pipeline](/doc/command-reference/dag) has been setup with these
[stages](/doc/command-reference/run):

```dvc
$ dvc pipeline show
data/Posts.xml.zip.dvc
Posts.xml.dvc
Posts.tsv.dvc
Posts-test.tsv.dvc
matrix-train.p.dvc
model.p.dvc
Dvcfile
test-posts
matrix-train
```
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

Imagine the <abbr>projects</abbr> has been modified such that the
Imagine the <abbr>project</abbr> has been modified such that the
<abbr>outputs</abbr> of some of these stages need to be uploaded to
[remote storage](/doc/command-reference/remote).

Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# repro

Reproduce complete or partial [pipelines](/doc/command-reference/pipeline) by
Reproduce complete or partial [pipelines](/doc/command-reference/dag) by
executing commands defined in their [stages](/doc/command-reference/run) in the
correct order. The commands to be executed are determined by recursively
analyzing dependencies and <abbr>outputs</abbr> of the target stages.
Expand Down Expand Up @@ -166,7 +166,7 @@ only execute the final stage.

For simplicity, let's build a pipeline defined below. (If you want get your
hands-on something more real, see this short
[pipeline tutorial](/doc/tutorials/pipelines)). It takes this `text.txt` file:
[pipeline tutorial](/doc/start/data-pipelines)). It takes this `text.txt` file:

```
dvc
Expand Down
12 changes: 6 additions & 6 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ positional arguments:
## Description

`dvc run` is a helper for creating or updating
[pipeline](/doc/command-reference/pipeline) stages in a `dvc.yaml` file (located
in the current working directory). _Stages_ represent individual data processes,
[pipeline](/doc/command-reference/dag) stages in a `dvc.yaml` file (located in
the current working directory). _Stages_ represent individual data processes,
including their input and resulting outputs.

A stage name is required and can be provided using the `-n` (`--name`) option.
Expand Down Expand Up @@ -112,8 +112,8 @@ run directly, for example a shell built-in, expression, or binary found in
by the command itself, not by `dvc run`.

⚠️ Note that while DVC is platform-agnostic, the commands defined in your
[pipeline](/doc/command-reference/pipeline) stages may only work on some
operating systems and require certain software packages to be installed.
[pipeline](/doc/command-reference/dag) stages may only work on some operating
systems and require certain software packages to be installed.

Wrap the command with double quotes `"` if there are special characters in it
like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to
Expand Down Expand Up @@ -330,8 +330,8 @@ $ tree ..

## Example: Chaining stages

DVC [pipelines](/doc/command-reference/pipeline) are constructed by connecting
the outputs of a stage to the dependencies of the following one(s).
DVC [pipelines](/doc/command-reference/dag) are constructed by connecting the
outputs of a stage to the dependencies of the following one(s).

Extract an XML file from an archive to the `data/` folder:

Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/status.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# status

Show changes in the <abbr>project</abbr>
[pipelines](/doc/command-reference/pipeline), as well as file mismatches either
[pipelines](/doc/command-reference/dag), as well as file mismatches either
between the <abbr>cache</abbr> and <abbr>workspace</abbr>, or between the cache
and remote storage.

Expand Down
2 changes: 1 addition & 1 deletion content/docs/use-cases/shared-development-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ Your colleagues can [checkout](/doc/command-reference/checkout) the
<abbr>project</abbr> data (from the shared <abbr>cache</abbr>), and have both
`raw` and `clean` data files appear in their workspace without moving anything
manually. After this, they could decide to continue building this
[pipeline](/doc/command-reference/pipeline) and process the clean data:
[pipeline](/doc/command-reference/dag) and process the clean data:

```dvc
$ git pull
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -354,14 +354,14 @@ a monolithic way. It uses the `save_bottleneck_feature` function to
pre-calculate the bottom, "frozen" part of the net every time it is run.
Features are written into files. The intention was probably that the
`save_bottleneck_feature` can be commented out after the first run, but it's not
very convenient having to remember to do si it every time the dataset changes.

Here's where the [pipelines](/doc/command-reference/pipeline) feature of DVC
comes in handy. We touched on it briefly when we described `dvc run` and
`dvc repro`. The next step would be splitting the script into two parts and
utilizing pipelines. See [this example](/doc/tutorials/pipelines) to get
hands-on experience with pipelines, and try to apply it here. Don't hesitate to
join our [community](/chat) and ask any questions!
very convenient having to remember to do so every time the dataset changes.

Here's where the [pipelines](/doc/command-reference/dag) feature of DVC comes in
handy. We touched on it briefly when we described `dvc run` and `dvc repro`. The
next step would be splitting the script into two parts and utilizing pipelines.
See [Data Pipelines](/doc/start/data-pipelines) to get hands-on experience with
pipelines, and try to apply it here. Don't hesitate to join our
[community](/chat) and ask any questions!

Another detail we only brushed upon here is the way we captured the
`metrics.csv` metric file with the `-M` option of `dvc run`. Marking this
Expand Down
4 changes: 2 additions & 2 deletions content/docs/user-guide/dvcignore.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ DVC-handled directories.

**It is crucial to understand, that DVC might remove ignored files upon
`dvc run` or `dvc repro`. If they are not produced by a
[pipeline](/doc/command-reference/pipeline) [stage](/doc/command-reference/run),
they can be deleted permanently.**
[pipeline](/doc/command-reference/dag) [stage](/doc/command-reference/run), they
can be deleted permanently.**

Keep in mind, that when you add to `.dvcignore` entries that affect one of the
existing <abbr>outputs</abbr>, its status will change and DVC will behave as if
Expand Down
2 changes: 1 addition & 1 deletion content/docs/user-guide/what-is-dvc/core-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
interface and Git workflow.

- It makes data science projects **reproducible** by creating lightweight
[pipelines](/doc/command-reference/pipeline) using implicit dependency graphs.
[pipelines](/doc/command-reference/dag) using implicit dependency graphs.

- **Large data file versioning** works by creating special files in your Git
repository that point to the <abbr>cache</abbr>, typically stored on a local
Expand Down
2 changes: 1 addition & 1 deletion content/docs/user-guide/what-is-dvc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ DVC uses a few core concepts:
(<abbr>dependencies</abbr>) and <abbr>outputs</abbr>. Pipelines are defined by
special [stage files](/doc/command-reference/run) (similar to
[Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction)).
Refer to [pipeline](/doc/command-reference/pipeline) for more information.
Refer to [pipeline](/doc/command-reference/dag) for more information.

- **Workflow**: Set of experiments and relationships among them. Workflow
corresponds to the entire Git repository.
Expand Down