Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: absorb What is DVC? into other existing docs, et al. #1581

Merged
merged 44 commits into from
Aug 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
6690907
guide: What is DVC? -> into UG index
jorgeorpinel Jul 15, 2020
ce72b11
how-to: create section with questions from WID / Collab Issues
jorgeorpinel Jul 15, 2020
b466986
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Jul 20, 2020
3f0b0f0
how-to: make subsection of the user-guide, and
jorgeorpinel Jul 20, 2020
5a5901b
guide: hide Best Practices how to for now
jorgeorpinel Jul 20, 2020
a94d9f8
guide: rename how to and best practices title
jorgeorpinel Jul 20, 2020
92ae254
guide: What->Why in index to avoid redundancy with What section
jorgeorpinel Jul 20, 2020
d7762e6
guide: concepts->principles in What is DVC?
jorgeorpinel Jul 20, 2020
82554fe
guide: move troubleshooting inside How To
jorgeorpinel Jul 20, 2020
05a7a7d
guide: collapse What is DVC? into single doc, and
jorgeorpinel Jul 20, 2020
3e46fca
guide: fix redirect test for troubleshooting how to
jorgeorpinel Jul 20, 2020
93cc607
guide: revise What is DVC? up to Core Principles and
jorgeorpinel Jul 20, 2020
8a7c086
guide: finish revising What is DVC?
jorgeorpinel Jul 20, 2020
c30e966
guide: more updates to What is DVC? (per 1.x) and
jorgeorpinel Jul 20, 2020
733593c
guide: review intro and reorg Related Technologies
jorgeorpinel Jul 20, 2020
28226d6
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Jul 22, 2020
7f421dd
guide: add Questions header to best practices (hidden)
jorgeorpinel Jul 22, 2020
d390a6f
guide: hide GAPI PP
jorgeorpinel Jul 22, 2020
c97e93b
guide: revise Git-LFS section of related techs
jorgeorpinel Jul 23, 2020
30d38df
guide: revise all Git* related techs
jorgeorpinel Jul 23, 2020
a450d39
guide: revise remaining related techs
jorgeorpinel Jul 23, 2020
a93f24f
guide: remove img from basic concepts
jorgeorpinel Jul 23, 2020
1a6948e
guide: move troubleshooting back out of How To
jorgeorpinel Jul 23, 2020
3e7b18a
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Aug 3, 2020
a556e6c
cases: move Why DVC? to Use Cases index and
jorgeorpinel Aug 4, 2020
eb5fbf9
guide: move Basic Principles from What is DVC? into Basic Concepts guide
jorgeorpinel Aug 4, 2020
c476f60
guide: remove "User Manual" term from index
jorgeorpinel Aug 4, 2020
3508a19
nav: remove ... from How To entry
jorgeorpinel Aug 4, 2020
2842e3a
tests: finis rolling back troubleshooting guide move
jorgeorpinel Aug 4, 2020
02c2b01
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Aug 8, 2020
7411f53
cases: fix a link to related techs guide
jorgeorpinel Aug 8, 2020
9c75ae8
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Aug 8, 2020
98ffea3
guide: propper structure in related techs
jorgeorpinel Aug 9, 2020
95521e6
guide: update remote storage core concept in what is dvc
jorgeorpinel Aug 9, 2020
fbd7e96
guide: improve Core Features of What is DVC?
jorgeorpinel Aug 9, 2020
86fbf43
guide: simplify data versioning core feature
jorgeorpinel Aug 9, 2020
de43edd
guide: update What is DVC? intro
jorgeorpinel Aug 9, 2020
bebd665
guide: simplify Core Features in What is DVC?
jorgeorpinel Aug 9, 2020
d8a71f8
guide: features before concepts (index)
jorgeorpinel Aug 9, 2020
722cbb3
guide: review term "features" in basic concepts
jorgeorpinel Aug 9, 2020
e6d5f78
guide: undo starting How To subsection
jorgeorpinel Aug 10, 2020
ca9203a
guide: undo changes to troubleshooting
jorgeorpinel Aug 10, 2020
eebb2e6
guide: a few more copy edits for What is DVC
jorgeorpinel Aug 10, 2020
c720552
guide: remove Basic Concepts page
jorgeorpinel Aug 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 3 additions & 11 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -83,14 +83,9 @@
"source": "user-guide/index.md",
"children": [
{
"slug": "what-is-dvc",
"label": "What is DVC?",
"source": "what-is-dvc/index.md",
"children": [
"collaboration-issues",
"core-features",
"related-technologies"
]
"slug": "what-is-dvc",
"source": "what-is-dvc.md"
},
{
"label": "DVC Files and Directories",
Expand Down Expand Up @@ -134,13 +129,10 @@
"slug": "running-dvc-on-windows"
},
"troubleshooting",
"related-technologies",
{
"label": "Anonymized Usage Analytics",
"slug": "analytics"
},
{
"label": "Privacy Policy (Google APIs)",
"slug": "privacy"
}
]
},
Expand Down
22 changes: 15 additions & 7 deletions content/docs/use-cases/index.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,26 @@
# Use Cases

We provide short articles on common ML workflow or data management scenarios
that DVC can help with or improve. These include the motivating context (usually
extracted from real-life cases); And the approaches to solving them can combine
several features of DVC. Use cases are not written to be run end-to-end. For
more general, hands-on experience with DVC, we recommend following the
[Get Started](/doc/tutorials/get-started), and/or [Tutorials](/doc/tutorials)
first.
that DVC can help with or improve. These include a motivation (usually from
real-life cases), and approaches which combine several features of DVC. Use
cases are not written to be run end-to-end like tutorials. For more general,
hands-on experience with DVC, please see our
[Get Started](/doc/tutorials/get-started) instead.

> We keep reviewing our docs and will include interesting scenarios that surface
> in the community. Please, [contact us](/support) if you need help or have
> suggestions!

## Basic uses
## Why DVC?

Even with all the success we've seen today in machine learning (ML), especially
with deep learning and its applications in business, the data science community
still lacks good practices for organizing their projects and collaborating
effectively. This is a critical challenge: while ML algorithms and methods are
no longer tribal knowledge, they are still difficult to implement, reuse, and
manage.

## Basic uses of DVC

If you store and process data files or datasets to produce other data or machine
learning models, and you want to
Expand Down
14 changes: 7 additions & 7 deletions content/docs/use-cases/versioning-data-and-model-files/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ This allows easily saving and sharing data alongside code.

![](/img/model-versioning-diagram.png)

In this basic scenario, DVC is a better replacement for `git-lfs` (see
[Related Technologies](/doc/understanding-dvc/related-technologies)) and for
ad-hoc scripts on top of Amazon S3 (or any other cloud) used to manage ML
<abbr>data artifacts</abbr> like raw data, models, etc. Unlike `git-lfs`, DVC
doesn't require installing a dedicated server; It can be used on-premises (e.g.
SSH, NAS) or with any major cloud storage provider (Amazon S3, Microsoft Azure
Blob Storage, Google Drive, Google Cloud Storage, etc).
In this basic scenario, DVC is a better replacement for Git-LFS (see
[Related Technologies](/doc/user-guide/related-technologies)) and for ad-hoc
scripts on top of Amazon S3 (or any other cloud) used to manage ML <abbr>data
artifacts</abbr> like raw data, models, etc. Unlike Git-LFS, DVC doesn't require
installing a dedicated server; It can be used on-premises (e.g. SSH, NAS) or
with any major cloud storage provider (Amazon S3, Microsoft Azure Blob Storage,
Google Drive, Google Cloud Storage, etc).

Let's say you already have a Git repository and put a bunch of images in the
`images/` directory, and build a `model.pkl` ML model file using them.
Expand Down
14 changes: 7 additions & 7 deletions content/docs/user-guide/index.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# User Guide

Our guides describe the main DVC concepts and features comprehensively,
explaining when and how to use them, as well as connections between them. These
guides don't focus on specific scenarios, but have a general scope – like a user
manual. Their topics range from more technical foundations, impacting more parts
of DVC, to more advanced and specific things you can do. We also include a few
guides related to contributing to
[this open-source project](https://github.com/iterative/dvc).
Our guides describe the major features and concepts of DVC comprehensively,
explaining when and how to use them, as well as relationship between these. We
don't focus on specific scenarios in this section, but rather on a general
scope. The topics here range from more foundational, impacting more parts of
DVC, to more technical and advanced things you can do. We also include a few
misc. guides, for example related to
[contributing to DVC](/doc/user-guide/contributing/core) itself.

Please choose from the navigation sidebar to the left, or click the `Next`
button below ↘
133 changes: 133 additions & 0 deletions content/docs/user-guide/related-technologies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
# Comparison with Related Technologies

DVC combines a number of existing ideas into a single tool, with the goal of
bringing best practices from software engineering into the data science field
(refer to [What is DVC?](/doc/user-guide/what-is-dvc) for more details).

## Git

- DVC builds upon Git by introducing the concept of data files – large files
that should not be stored in a Git repository, but still need to be tracked
and versioned. It leverages Git's features to enable managing different
versions of data itself, data pipelines, and experiments.

- DVC is not fundamentally bound to Git, and can work without it (except
versioning-related features). This also applies to Git-LFS and Git-annex,
below.

## Git-LFS (Large File Storage)

- DVC does not require special servers like Git-LFS demands. Any cloud storage
like S3, Google Cloud Storage, or even an SSH server can be used as a
[remote storage](/doc/command-reference/remote). No additional databases,
servers, or infrastructure are required.

- DVC does not add any hooks to the Git repo by default (although they are
[available](/doc/command-reference/install)).

- Git-LFS was not made with data science in mind, so it doesn't provide related
features (e.g. [pipelines](/doc/command-reference/dag),
[metrics](/doc/command-reference/metrics), etc.).

- Github (most common Git hosting service) has a limit of 2 GB per repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: GitHub

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think both are OK. Technically Github is more correct writing-wise. We have a mix of both forms rn though, I'll standardize in another PR. Will use GitHub since you prefer that 🙂


## Git-annex

- DVC can use reflinks\* or hardlinks (depending on the system) instead of
symlinks to improve performance and the user experience.

- Git-annex is a datafile-centric system whereas DVC focuses on providing a
workflow for machine learning and reproducible experiments. When a DVC or
Git-annex repository is cloned via `git clone`, data files won't be copied to
the local machine, as file contents are stored in separate
[remotes](/doc/command-reference/remote). With DVC however, `.dvc` files,
which provide the reproducible workflow, are always included in the Git
repository. Hence, they can be executed locally with minimal effort.

- DVC optimizes file hash calculation.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

> \* **copy-on-write links or "reflinks"** are a relatively new way to link
> files in UNIX-style file systems. Unlike hardlinks or symlinks, they support
> transparent [copy on write](https://en.wikipedia.org/wiki/Copy-on-write). This
> means that editing a reflinked file is always safe as all the other links to
> the file will reflect the changes.

## Git workflows/methodologies such as Gitflow

- DVC enables a new experimentation methodology that integrates easily with
existing Git workflows. For example, a separate branch can be created for each
experiment, with a subsequent merge of the branch if the experiment is
successful.

- DVC innovates by giving users the ability to easily navigate through past
experiments without recomputing them each time.

## Workflow management systems

Pipelines and dependency graphs
([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) such as _Airflow_,
_Luigi_, etc.

- DVC is focused on data science and modeling. As a result, DVC pipelines are
lightweight and easy to create and modify. However, DVC lacks advanced
pipeline execution features like execution monitoring, error handling, and
recovering.
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

- `dvc` is purely a command line tool without a graphical user interface (GUI)
and doesn't run any daemons or servers. Nevertheless, DVC can generate images
with pipeline and experiment workflow visualizations.

- See also our sister project, [CML](https://cml.dev/), that helps fill some of
these gaps.
Comment on lines +79 to +81
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this appropriate to add here? (Under Workflow management tools)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, but it's fine to keep it foe not ;)


## Experiment management software

- DVC uses Git as the underlying layer for data, pipelines, an experiment
versioning, instead of a custom web application.

- DVC doesn't need to run any services. There's no GUI as a result, but we
expect some GUI services will be created on top of DVC.

- DVC can generate images with [experiment](/doc/start/experiments) workflow
visualizations.

- DVC has transparent design. Its
[internal files and directories](/doc/user-guide/dvc-files-and-directories)
have a human-readable format and can be easily reused by external tools.

## Build automation tools

[_Make_](https://www.gnu.org/software/make/) and others.

- File tracking:

- DVC tracks files based on their hash values (MD5) instead of using
timestamps. This helps avoid running into heavy processes like model
retraining when you checkout a previous version of the project (Make would
retrain the model).

- DVC uses file timestamps and inodes\* for optimization. This allows DVC to
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
avoid recomputing all dependency file hashes, which would be highly
problematic when working with large files (multiple GB).

- DVC utilizes a
[directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph)
(DAG):

- The DAG or dependency graph is defined implicitly by the connections between
pipeline [stages](/doc/command-reference/run), based on their
<abbr>dependencies</abbr> and <abbr>outputs</abbr>.

- Each stage defines one node in the DAG. All DVC-files in a repository make
up a [pipelines](/doc/command-reference/dag) (think a single Makefile). All
stages (and corresponding processes) are implicitly combined through their
inputs and outputs, simplifying conflict resolution during merges.

- DVC stages can be written manually in an intuitive `dvc.yaml` file, or
generated by the helper command `dvc run`, based on a terminal command, its
inputs, and outputs.

> \* **Inodes** are metadata file records to locate and store permissions to the
> actual file contents. See **Linking files** in
> [this doc](http://www.tldp.org/LDP/intro-linux/html/sect_03_03.html) for
> technical details (Linux).
48 changes: 48 additions & 0 deletions content/docs/user-guide/what-is-dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# What Is DVC?

**Data Version Control** is a new type of data versioning, workflow and
experiment management software, that builds upon [Git](https://git-scm.com/)
(although it can work stand-alone). DVC reduces the gap between established
engineering tool sets and data science needs, allowing users to take advantage
of new [features](#core-features) while reusing existing skills and intuition.

![](/img/reproducibility.png) _DVC codifies data and ML experiments_

Data science experiment sharing and collaboration can be done through a regular
Git flow (commits, branching, pull requests, etc.), the same way it works for
software engineers.

## Core Features

- DVC is a [free](https://github.com/iterative/dvc/blob/master/LICENSE),
open-source [command line](/doc/command-reference) tool.

- DVC works **on top of Git repositories** and has a similar command line
interface and flow as Git. DVC can also work stand-alone, but without
versioning capabilities.

- **Data versioning** is enabled by replacing large files], dataset directories,
ML models, etc. with small
[metafiles](/doc/user-guide/dvc-files-and-directories) (easy to handle with
Git). These placeholders point to the original data, which is decoupled from
source code management.

- **Data storage**: On-premises or cloud storage can be used to store the
project's data separate from its code base. This is how data scientists can
transfer large datasets or share a GPU-trained model with others.

- DVC makes data science projects **reproducible** by creating lightweight
[pipelines](/doc/command-reference/dag) using implicit dependency graphs,and
codifying the data and artifacts involved.

- DVC is **platform agnostic**: It runs on all major operating systems (Linux,
MacOS, and Windows), and works independently of the programming languages
(Python, R, Julia, shell scripts, etc.) or ML libraries (Keras, Tensorflow,
PyTorch, Scipy, etc.) used in the <abbr>project</abbr>.

- **Easy to use**: DVC is quick to [install](/doc/install) and doesn't require
special infrastructure, nor does it depend on APIS or external services. It's
a stand-alone CLI tool.

> Git servers, as well as SSH and cloud storage providers are supported,
> however.
53 changes: 0 additions & 53 deletions content/docs/user-guide/what-is-dvc/collaboration-issues.md

This file was deleted.

20 changes: 0 additions & 20 deletions content/docs/user-guide/what-is-dvc/core-features.md

This file was deleted.

Loading