Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dag: explanation of what pipelines are #1953

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions content/docs/command-reference/dag.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,17 @@ positional arguments:

## Description

A data pipeline, in general, is a series of data processing
Data science and machine learning pipelines are used to help automate the
workflow. A data pipeline, in general, is a series of data processing
[stages](/doc/command-reference/run) (for example, console commands that take an
input and produce an <abbr>output</abbr>). A pipeline may produce intermediate
data, and has a final result.

Data science and machine learning pipelines typically start with large raw
datasets, include intermediate featurization and training stages, and produce a
final model, as well as accuracy [metrics](/doc/command-reference/metrics).
Pipelines typically start with large raw datasets, include intermediate data
pre-processing, featurization and training stages, and produce a final model, as
well as accuracy [metrics](/doc/command-reference/metrics). A pipeline makes it
easy to iterate over the workflow as it doesn't require to execute each stage
individually.

In DVC, pipeline stages and commands, their data I/O, interdependencies, and
results (intermediate or final) are specified in `dvc.yaml`, which can be
Expand Down