diff --git a/content/docs/command-reference/commit.md b/content/docs/command-reference/commit.md index f70c3388e2..100052637b 100644 --- a/content/docs/command-reference/commit.md +++ b/content/docs/command-reference/commit.md @@ -20,10 +20,10 @@ positional arguments: The `dvc commit` command is useful for several scenarios, when data already tracked by DVC changes: when a [stage](/doc/command-reference/run) or -[pipeline](/doc/command-reference/pipeline) is in development/experimentation; -when manually editing or generating DVC outputs; or to force -DVC-file updates without reproducing stages or pipelines. These scenarios are -further detailed below. +[pipeline](/doc/command-reference/dag) is in development/experimentation; when +manually editing or generating DVC outputs; or to force DVC-file +updates without reproducing stages or pipelines. These scenarios are further +detailed below. - Code or data for a stage is under active development, with multiple iterations (experiments) in code, configuration, or data. Use the `--no-commit` option of diff --git a/content/docs/command-reference/fetch.md b/content/docs/command-reference/fetch.md index aa79455b2b..8eeda06825 100644 --- a/content/docs/command-reference/fetch.md +++ b/content/docs/command-reference/fetch.md @@ -49,7 +49,7 @@ on DVC remotes.) These necessary data or model files are listed as dependencies or outputs in a target [stage](/doc/command-reference/run) (in `dvc.yaml`) or `.dvc` file, so they are required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the -corresponding [pipeline](/doc/command-reference/pipeline). +corresponding [pipeline](/doc/command-reference/dag). `dvc fetch` ensures that the files needed for a stage or `.dvc` file to be [reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in @@ -136,9 +136,6 @@ $ cd example-get-started -The workspace looks almost like in this -[pipeline setup](/doc/tutorials/pipelines): - ```dvc . ├── data diff --git a/content/docs/command-reference/freeze.md b/content/docs/command-reference/freeze.md index 0ab6b75fc5..b0321cfc79 100644 --- a/content/docs/command-reference/freeze.md +++ b/content/docs/command-reference/freeze.md @@ -20,7 +20,7 @@ reproduction will not regenerate outputs of frozen stages, even if their dependencies have changed, and even if `--force` is used. Freezing a stage is useful to avoid syncing data from the top of its -[pipeline](/doc/command-reference/pipeline), and keep iterating on the last +[pipeline](/doc/command-reference/dag), and keep iterating on the last (non-frozen) stages only. Note that import stages are frozen by default. Use `dvc update` to diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index bcadb7c6cf..d3d84f0322 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -194,8 +194,8 @@ trying this example (especially if trying out the following one). What if that remote file is updated regularly? The project goals might include regenerating some results based on the updated data source. -[Pipeline](/doc/command-reference/pipeline) reproduction can be triggered based -on a changed external dependency. +[Pipeline](/doc/command-reference/dag) reproduction can be triggered based on a +changed external dependency. Let's use the [Get Started](/doc/tutorials/get-started) project again, simulating an updated external data source. (Remember to prepare the diff --git a/content/docs/command-reference/index.md b/content/docs/command-reference/index.md index f520bf9248..9e813665d0 100644 --- a/content/docs/command-reference/index.md +++ b/content/docs/command-reference/index.md @@ -10,7 +10,7 @@ DVC is a command line tool. The typical DVC workflow goes as follows: `dvc run` command, along with its `--outs` option for outputs that should also be tracked by DVC after the code is executed. - Sharing a Git repository with the source code of your ML - [pipeline](/doc/command-reference/pipeline) will not include the project's + [pipeline](/doc/command-reference/dag) will not include the project's cache. Use [remote storage](/doc/command-reference/remote) and `dvc push` to share this cache (data tracked by DVC). - Use `dvc repro` to automatically reproduce your full pipeline, iteratively as diff --git a/content/docs/command-reference/install.md b/content/docs/command-reference/install.md index d04a692973..d0f1f942b8 100644 --- a/content/docs/command-reference/install.md +++ b/content/docs/command-reference/install.md @@ -33,7 +33,7 @@ This hook automates `dvc checkout` after `git checkout`. **Commit/Reproduce**: Before committing DVC changes with Git, it may be necessary using `dvc commit` to store new data files not yet in cache. Or the changes might require reproducing the corresponding -[pipeline](/doc/command-reference/pipeline) (with `dvc repro`) to regenerate the +[pipeline](/doc/command-reference/dag) (with `dvc repro`) to regenerate the project's results (which implicitly commits them to DVC as well). This hook automates `dvc status` before `git commit` when needed, to remind the diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 747ba667e5..c05893072c 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -49,8 +49,8 @@ used to see what files `dvc pull` would download. If one or more `targets` are specified, DVC only considers the files associated with those stages or `.dvc` files. Using the `--with-deps` option, DVC tracks dependencies backward from the target [stage files](/doc/command-reference/run), -through the corresponding [pipelines](/doc/command-reference/pipeline), to find -data files to pull. +through the corresponding [pipelines](/doc/command-reference/dag), to find data +files to pull. After a data file is in cache, `dvc pull` can use OS-specific mechanisms like reflinks or hardlinks to put it in the workspace without copying. See @@ -129,9 +129,6 @@ $ cd example-get-started -The workspace looks almost like in this -[pipeline setup](/doc/tutorials/pipelines): - ```dvc . ├── data @@ -167,17 +164,9 @@ $ dvc pull train.dvc > Please delete the `.dvc/cache` directory first (with `rm -Rf .dvc/cache`) to > follow this example if you tried the previous ones. -Our [pipeline](/doc/command-reference/pipeline) has been setup with these -[stages](/doc/command-reference/run): - -```dvc -$ dvc pipeline show evaluate.dvc -data/data.xml.dvc -prepare.dvc -featurize.dvc -train.dvc -evaluate.dvc -``` +Our [pipeline](/doc/command-reference/dag) has been setup with these +[stages](/doc/command-reference/run): `prepare`, `featurize`, `train`, +`evaluate`. Imagine the [remote storage](/doc/command-reference/remote) has been modified such that the data in some of these stages should be updated in the @@ -195,7 +184,7 @@ One could do a simple `dvc pull` to get all the data, but what if you only want to retrieve part of the data? ```dvc -$ dvc pull --with-deps featurize.dvc +$ dvc pull --with-deps featurize ... Use the partial update, then pull the remaining data: diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index e5b1aacf15..0617f8b5ad 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -67,7 +67,7 @@ cache (compared to the default remote.) It can be used to see what files If one or more `targets` are specified, DVC only considers the files associated with them. Using the `--with-deps` option, DVC tracks dependencies backward from the target [stage files](/doc/command-reference/run), through the corresponding -[pipelines](/doc/command-reference/pipeline), to find data files to push. +[pipelines](/doc/command-reference/dag), to find data files to push. ## Options @@ -151,27 +151,16 @@ $ dvc push data.zip.dvc ## Example: With dependencies Demonstrating the `--with-deps` option requires a larger example. First, assume -a [pipeline](/doc/command-reference/pipeline) has been setup with these -[stages](/doc/command-reference/run): +a [pipeline](/doc/command-reference/dag) has been setup with these +[stages](/doc/command-reference/run): `clean-posts`, `featurize`, `test-posts`, +`matrix-train` -```dvc -$ dvc pipeline show -data/Posts.xml.zip.dvc -Posts.xml.dvc -Posts.tsv.dvc -Posts-test.tsv.dvc -matrix-train.p.dvc -model.p.dvc -Dvcfile -``` - -Imagine the projects has been modified such that the +Imagine the project has been modified such that the outputs of some of these stages need to be uploaded to [remote storage](/doc/command-reference/remote). ```dvc $ dvc status --cloud - new: data/model.p new: data/matrix-test.p new: data/matrix-train.p @@ -190,7 +179,6 @@ $ dvc push --with-deps model.p.dvc ... Push the rest of the data $ dvc status --cloud - Data and pipelines are up to date. ``` diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index 8245f80434..862b3dace5 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -1,6 +1,6 @@ # repro -Reproduce complete or partial [pipelines](/doc/command-reference/pipeline) by +Reproduce complete or partial [pipelines](/doc/command-reference/dag) by executing commands defined in their [stages](/doc/command-reference/run) in the correct order. The commands to be executed are determined by recursively analyzing dependencies and outputs of the target stages. @@ -166,7 +166,7 @@ only execute the final stage. For simplicity, let's build a pipeline defined below. (If you want get your hands-on something more real, see this short -[pipeline tutorial](/doc/tutorials/pipelines)). It takes this `text.txt` file: +[pipeline tutorial](/doc/start/data-pipelines)). It takes this `text.txt` file: ``` dvc diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index c25e1e1031..d45892f262 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -22,8 +22,8 @@ positional arguments: ## Description `dvc run` is a helper for creating or updating -[pipeline](/doc/command-reference/pipeline) stages in a `dvc.yaml` file (located -in the current working directory). _Stages_ represent individual data processes, +[pipeline](/doc/command-reference/dag) stages in a `dvc.yaml` file (located in +the current working directory). _Stages_ represent individual data processes, including their input and resulting outputs. A stage name is required and can be provided using the `-n` (`--name`) option. @@ -112,8 +112,8 @@ run directly, for example a shell built-in, expression, or binary found in by the command itself, not by `dvc run`. ⚠️ Note that while DVC is platform-agnostic, the commands defined in your -[pipeline](/doc/command-reference/pipeline) stages may only work on some -operating systems and require certain software packages to be installed. +[pipeline](/doc/command-reference/dag) stages may only work on some operating +systems and require certain software packages to be installed. Wrap the command with double quotes `"` if there are special characters in it like `|` (pipe) or `<`, `>` (redirection), otherwise they would apply to @@ -330,8 +330,8 @@ $ tree .. ## Example: Chaining stages -DVC [pipelines](/doc/command-reference/pipeline) are constructed by connecting -the outputs of a stage to the dependencies of the following one(s). +DVC [pipelines](/doc/command-reference/dag) are constructed by connecting the +outputs of a stage to the dependencies of the following one(s). Extract an XML file from an archive to the `data/` folder: diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 5dddb25835..7fc408869c 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -1,7 +1,7 @@ # status Show changes in the project -[pipelines](/doc/command-reference/pipeline), as well as file mismatches either +[pipelines](/doc/command-reference/dag), as well as file mismatches either between the cache and workspace, or between the cache and remote storage. diff --git a/content/docs/use-cases/shared-development-server.md b/content/docs/use-cases/shared-development-server.md index 115c32cce3..b42c60770a 100644 --- a/content/docs/use-cases/shared-development-server.md +++ b/content/docs/use-cases/shared-development-server.md @@ -91,7 +91,7 @@ Your colleagues can [checkout](/doc/command-reference/checkout) the project data (from the shared cache), and have both `raw` and `clean` data files appear in their workspace without moving anything manually. After this, they could decide to continue building this -[pipeline](/doc/command-reference/pipeline) and process the clean data: +[pipeline](/doc/command-reference/dag) and process the clean data: ```dvc $ git pull diff --git a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md index eab8bba94b..c78727ca7a 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/tutorial.md +++ b/content/docs/use-cases/versioning-data-and-model-files/tutorial.md @@ -354,14 +354,14 @@ a monolithic way. It uses the `save_bottleneck_feature` function to pre-calculate the bottom, "frozen" part of the net every time it is run. Features are written into files. The intention was probably that the `save_bottleneck_feature` can be commented out after the first run, but it's not -very convenient having to remember to do si it every time the dataset changes. - -Here's where the [pipelines](/doc/command-reference/pipeline) feature of DVC -comes in handy. We touched on it briefly when we described `dvc run` and -`dvc repro`. The next step would be splitting the script into two parts and -utilizing pipelines. See [this example](/doc/tutorials/pipelines) to get -hands-on experience with pipelines, and try to apply it here. Don't hesitate to -join our [community](/chat) and ask any questions! +very convenient having to remember to do so every time the dataset changes. + +Here's where the [pipelines](/doc/command-reference/dag) feature of DVC comes in +handy. We touched on it briefly when we described `dvc run` and `dvc repro`. The +next step would be splitting the script into two parts and utilizing pipelines. +See [Data Pipelines](/doc/start/data-pipelines) to get hands-on experience with +pipelines, and try to apply it here. Don't hesitate to join our +[community](/chat) and ask any questions! Another detail we only brushed upon here is the way we captured the `metrics.csv` metric file with the `-M` option of `dvc run`. Marking this diff --git a/content/docs/user-guide/dvcignore.md b/content/docs/user-guide/dvcignore.md index 49d29a5fee..eceef9dff1 100644 --- a/content/docs/user-guide/dvcignore.md +++ b/content/docs/user-guide/dvcignore.md @@ -29,8 +29,8 @@ DVC-handled directories. **It is crucial to understand, that DVC might remove ignored files upon `dvc run` or `dvc repro`. If they are not produced by a -[pipeline](/doc/command-reference/pipeline) [stage](/doc/command-reference/run), -they can be deleted permanently.** +[pipeline](/doc/command-reference/dag) [stage](/doc/command-reference/run), they +can be deleted permanently.** Keep in mind, that when you add to `.dvcignore` entries that affect one of the existing outputs, its status will change and DVC will behave as if diff --git a/content/docs/user-guide/what-is-dvc/core-features.md b/content/docs/user-guide/what-is-dvc/core-features.md index 44dee4ad2e..5960a1feed 100644 --- a/content/docs/user-guide/what-is-dvc/core-features.md +++ b/content/docs/user-guide/what-is-dvc/core-features.md @@ -4,7 +4,7 @@ interface and Git workflow. - It makes data science projects **reproducible** by creating lightweight - [pipelines](/doc/command-reference/pipeline) using implicit dependency graphs. + [pipelines](/doc/command-reference/dag) using implicit dependency graphs. - **Large data file versioning** works by creating special files in your Git repository that point to the cache, typically stored on a local diff --git a/content/docs/user-guide/what-is-dvc/index.md b/content/docs/user-guide/what-is-dvc/index.md index 6755f75520..752dbfbe77 100644 --- a/content/docs/user-guide/what-is-dvc/index.md +++ b/content/docs/user-guide/what-is-dvc/index.md @@ -43,7 +43,7 @@ DVC uses a few core concepts: (dependencies) and outputs. Pipelines are defined by special [stage files](/doc/command-reference/run) (similar to [Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction)). - Refer to [pipeline](/doc/command-reference/pipeline) for more information. + Refer to [pipeline](/doc/command-reference/dag) for more information. - **Workflow**: Set of experiments and relationships among them. Workflow corresponds to the entire Git repository.