-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
guide: absorb What is DVC? into other existing docs, et al. #1581
Changes from all commits
6690907
ce72b11
b466986
3f0b0f0
5a5901b
a94d9f8
92ae254
d7762e6
82554fe
05a7a7d
3e46fca
93cc607
8a7c086
c30e966
733593c
28226d6
7f421dd
d390a6f
c97e93b
30d38df
a450d39
a93f24f
1a6948e
3e7b18a
a556e6c
eb5fbf9
c476f60
3508a19
2842e3a
02c2b01
7411f53
9c75ae8
98ffea3
95521e6
fbd7e96
86fbf43
de43edd
bebd665
d8a71f8
722cbb3
e6d5f78
ca9203a
eebb2e6
c720552
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,12 @@ | ||
# User Guide | ||
|
||
Our guides describe the main DVC concepts and features comprehensively, | ||
explaining when and how to use them, as well as connections between them. These | ||
guides don't focus on specific scenarios, but have a general scope – like a user | ||
manual. Their topics range from more technical foundations, impacting more parts | ||
of DVC, to more advanced and specific things you can do. We also include a few | ||
guides related to contributing to | ||
[this open-source project](https://github.com/iterative/dvc). | ||
Our guides describe the major features and concepts of DVC comprehensively, | ||
explaining when and how to use them, as well as relationship between these. We | ||
don't focus on specific scenarios in this section, but rather on a general | ||
scope. The topics here range from more foundational, impacting more parts of | ||
DVC, to more technical and advanced things you can do. We also include a few | ||
misc. guides, for example related to | ||
[contributing to DVC](/doc/user-guide/contributing/core) itself. | ||
|
||
Please choose from the navigation sidebar to the left, or click the `Next` | ||
button below ↘ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,133 @@ | ||
# Comparison with Related Technologies | ||
|
||
DVC combines a number of existing ideas into a single tool, with the goal of | ||
bringing best practices from software engineering into the data science field | ||
(refer to [What is DVC?](/doc/user-guide/what-is-dvc) for more details). | ||
|
||
## Git | ||
|
||
- DVC builds upon Git by introducing the concept of data files – large files | ||
that should not be stored in a Git repository, but still need to be tracked | ||
and versioned. It leverages Git's features to enable managing different | ||
versions of data itself, data pipelines, and experiments. | ||
|
||
- DVC is not fundamentally bound to Git, and can work without it (except | ||
versioning-related features). This also applies to Git-LFS and Git-annex, | ||
below. | ||
|
||
## Git-LFS (Large File Storage) | ||
|
||
- DVC does not require special servers like Git-LFS demands. Any cloud storage | ||
like S3, Google Cloud Storage, or even an SSH server can be used as a | ||
[remote storage](/doc/command-reference/remote). No additional databases, | ||
servers, or infrastructure are required. | ||
|
||
- DVC does not add any hooks to the Git repo by default (although they are | ||
[available](/doc/command-reference/install)). | ||
|
||
- Git-LFS was not made with data science in mind, so it doesn't provide related | ||
features (e.g. [pipelines](/doc/command-reference/dag), | ||
[metrics](/doc/command-reference/metrics), etc.). | ||
|
||
- Github (most common Git hosting service) has a limit of 2 GB per repository. | ||
|
||
## Git-annex | ||
|
||
- DVC can use reflinks\* or hardlinks (depending on the system) instead of | ||
symlinks to improve performance and the user experience. | ||
|
||
- Git-annex is a datafile-centric system whereas DVC focuses on providing a | ||
workflow for machine learning and reproducible experiments. When a DVC or | ||
Git-annex repository is cloned via `git clone`, data files won't be copied to | ||
the local machine, as file contents are stored in separate | ||
[remotes](/doc/command-reference/remote). With DVC however, `.dvc` files, | ||
which provide the reproducible workflow, are always included in the Git | ||
repository. Hence, they can be executed locally with minimal effort. | ||
|
||
- DVC optimizes file hash calculation. | ||
jorgeorpinel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
> \* **copy-on-write links or "reflinks"** are a relatively new way to link | ||
> files in UNIX-style file systems. Unlike hardlinks or symlinks, they support | ||
> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This | ||
> means that editing a reflinked file is always safe as all the other links to | ||
> the file will reflect the changes. | ||
|
||
## Git workflows/methodologies such as Gitflow | ||
|
||
- DVC enables a new experimentation methodology that integrates easily with | ||
existing Git workflows. For example, a separate branch can be created for each | ||
experiment, with a subsequent merge of the branch if the experiment is | ||
successful. | ||
|
||
- DVC innovates by giving users the ability to easily navigate through past | ||
experiments without recomputing them each time. | ||
|
||
## Workflow management systems | ||
|
||
Pipelines and dependency graphs | ||
([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) such as _Airflow_, | ||
_Luigi_, etc. | ||
|
||
- DVC is focused on data science and modeling. As a result, DVC pipelines are | ||
lightweight and easy to create and modify. However, DVC lacks advanced | ||
pipeline execution features like execution monitoring, error handling, and | ||
recovering. | ||
shcheklein marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- `dvc` is purely a command line tool without a graphical user interface (GUI) | ||
and doesn't run any daemons or servers. Nevertheless, DVC can generate images | ||
with pipeline and experiment workflow visualizations. | ||
|
||
- See also our sister project, [CML](https://cml.dev/), that helps fill some of | ||
these gaps. | ||
Comment on lines
+79
to
+81
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this appropriate to add here? (Under Workflow management tools) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably not, but it's fine to keep it foe not ;) |
||
|
||
## Experiment management software | ||
|
||
- DVC uses Git as the underlying layer for data, pipelines, an experiment | ||
versioning, instead of a custom web application. | ||
|
||
- DVC doesn't need to run any services. There's no GUI as a result, but we | ||
expect some GUI services will be created on top of DVC. | ||
|
||
- DVC can generate images with [experiment](/doc/start/experiments) workflow | ||
visualizations. | ||
|
||
- DVC has transparent design. Its | ||
[internal files and directories](/doc/user-guide/dvc-files-and-directories) | ||
have a human-readable format and can be easily reused by external tools. | ||
|
||
## Build automation tools | ||
|
||
[_Make_](https://www.gnu.org/software/make/) and others. | ||
|
||
- File tracking: | ||
|
||
- DVC tracks files based on their hash values (MD5) instead of using | ||
timestamps. This helps avoid running into heavy processes like model | ||
retraining when you checkout a previous version of the project (Make would | ||
retrain the model). | ||
|
||
- DVC uses file timestamps and inodes\* for optimization. This allows DVC to | ||
shcheklein marked this conversation as resolved.
Show resolved
Hide resolved
|
||
avoid recomputing all dependency file hashes, which would be highly | ||
problematic when working with large files (multiple GB). | ||
|
||
- DVC utilizes a | ||
[directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) | ||
(DAG): | ||
|
||
- The DAG or dependency graph is defined implicitly by the connections between | ||
pipeline [stages](/doc/command-reference/run), based on their | ||
<abbr>dependencies</abbr> and <abbr>outputs</abbr>. | ||
|
||
- Each stage defines one node in the DAG. All DVC-files in a repository make | ||
up a [pipelines](/doc/command-reference/dag) (think a single Makefile). All | ||
stages (and corresponding processes) are implicitly combined through their | ||
inputs and outputs, simplifying conflict resolution during merges. | ||
|
||
- DVC stages can be written manually in an intuitive `dvc.yaml` file, or | ||
generated by the helper command `dvc run`, based on a terminal command, its | ||
inputs, and outputs. | ||
|
||
> \* **Inodes** are metadata file records to locate and store permissions to the | ||
> actual file contents. See **Linking files** in | ||
> [this doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for | ||
> technical details (Linux). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# What Is DVC? | ||
|
||
**Data Version Control** is a new type of data versioning, workflow and | ||
experiment management software, that builds upon [Git](https://git-scm.com/) | ||
(although it can work stand-alone). DVC reduces the gap between established | ||
engineering tool sets and data science needs, allowing users to take advantage | ||
of new [features](#core-features) while reusing existing skills and intuition. | ||
|
||
![](/img/reproducibility.png) _DVC codifies data and ML experiments_ | ||
|
||
Data science experiment sharing and collaboration can be done through a regular | ||
Git flow (commits, branching, pull requests, etc.), the same way it works for | ||
software engineers. | ||
|
||
## Core Features | ||
|
||
- DVC is a [free](https://github.com/iterative/dvc/blob/master/LICENSE), | ||
open-source [command line](/doc/command-reference) tool. | ||
|
||
- DVC works **on top of Git repositories** and has a similar command line | ||
interface and flow as Git. DVC can also work stand-alone, but without | ||
versioning capabilities. | ||
|
||
- **Data versioning** is enabled by replacing large files], dataset directories, | ||
ML models, etc. with small | ||
[metafiles](/doc/user-guide/dvc-files-and-directories) (easy to handle with | ||
Git). These placeholders point to the original data, which is decoupled from | ||
source code management. | ||
|
||
- **Data storage**: On-premises or cloud storage can be used to store the | ||
project's data separate from its code base. This is how data scientists can | ||
transfer large datasets or share a GPU-trained model with others. | ||
|
||
- DVC makes data science projects **reproducible** by creating lightweight | ||
[pipelines](/doc/command-reference/dag) using implicit dependency graphs,and | ||
codifying the data and artifacts involved. | ||
|
||
- DVC is **platform agnostic**: It runs on all major operating systems (Linux, | ||
MacOS, and Windows), and works independently of the programming languages | ||
(Python, R, Julia, shell scripts, etc.) or ML libraries (Keras, Tensorflow, | ||
PyTorch, Scipy, etc.) used in the <abbr>project</abbr>. | ||
|
||
- **Easy to use**: DVC is quick to [install](/doc/install) and doesn't require | ||
special infrastructure, nor does it depend on APIS or external services. It's | ||
a stand-alone CLI tool. | ||
|
||
> Git servers, as well as SSH and cloud storage providers are supported, | ||
> however. |
This file was deleted.
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: GitHub
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think both are OK. Technically Github is more correct writing-wise. We have a mix of both forms rn though, I'll standardize in another PR. Will use GitHub since you prefer that 🙂