Skip to content

Commit

Permalink
DVC file -> .dvc file (2nd chunk) reopen (#1408)
Browse files Browse the repository at this point in the history
* term: DVC-file -> .dvc file from Utkarsh work (2nd chunk)
per #1372 (review)

* 2nd chunk DVC-file -> .dvc file

* Added links to first occurences of .dvc files in /basic-concepts/

* Formatting changes

* Update content/docs/user-guide/basic-concepts/dvc-project.md

* Update content/docs/user-guide/basic-concepts/external-dependency.md

* Review changes - I

* Update content/docs/command-reference/pull.md

Co-authored-by: Jorge Orpinel <[email protected]>

* Review changes - II

* Update content/docs/command-reference/pull.md

* Update content/docs/command-reference/pull.md

* Update content/docs/command-reference/pull.md

* Update content/docs/command-reference/pull.md

* Update content/docs/command-reference/remove.md

* update content/docs/command-reference/pull.md push.md remove.md

* update content/docs/command-reference/fetch.md status.md

* Update content/docs/command-reference/pull.md

* Update content/docs/command-reference/pull.md

* Update content/docs/command-reference/push.md

* Update content/docs/command-reference/fetch.md

* Update content/docs/command-reference/pull.md

* Update content/docs/command-reference/push.md

* Update content/docs/command-reference/remove.md

* Update content/docs/command-reference/status.md

* Update content/docs/command-reference/update.md

* formatting content/docs/command-reference/pull.md

* update content/docs/command-reference/status push fetch

* Update content/docs/command-reference/status.md

* update "content/docs/command-reference/fetch.md"

* update content/docs/command-reference/import.md

* update content/docs/command-reference/import-url.md

* update content/docs/command-reference/list.md

* Update content/docs/command-reference/status.md

* update content/docs/command-reference/status.md

* Update content/docs/command-reference/status.md

* Update content/docs/command-reference/status.md

* Update content/docs/command-reference/status.md

Co-authored-by: Jorge Orpinel <[email protected]>
Co-authored-by: Jorge Orpinel <[email protected]>
  • Loading branch information
3 people authored Jun 12, 2020
1 parent 64182e2 commit 55ba0a1
Show file tree
Hide file tree
Showing 14 changed files with 222 additions and 186 deletions.
60 changes: 32 additions & 28 deletions content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ usage: dvc fetch [-h] [-q | -v] [-j <number>]
[targets [targets ...]]
positional arguments:
targets Limit command scope to these DVC-files. Using -R,
directories to search DVC-files in can also be given.
targets Limit command scope to these stages or .dvc files.
Using -R, directories can also be given, to search
.dvc files in.
```

## Description
Expand All @@ -22,7 +23,8 @@ of the project, but without placing them in the <abbr>workspace</abbr>. This
makes the data files available for linking (or copying) into the workspace.
(Refer to [dvc config cache.type](/doc/command-reference/config#cache).) Along
with `dvc checkout`, it's performed automatically by `dvc pull` when the target
[DVC-files](/doc/user-guide/dvc-file-format) are not already in the cache:
[`dvc.yaml`](/doc/user-guide/dvc-file-format) or
[`.dvc`](/doc/user-guide/dvc-file-format) files are not already in the cache:

```
Controlled files Commands
Expand All @@ -45,18 +47,17 @@ Fetching could be useful when first checking out a <abbr>DVC project</abbr>,
since files tracked by DVC should already exist in remote storage, but won't be
in the project's <abbr>cache</abbr>. (Refer to `dvc remote` for more information
on DVC remotes.) These necessary data or model files are listed as
<abbr>dependencies</abbr> or <abbr>outputs</abbr> in a DVC-file (target
[stage](/doc/command-reference/run)) so they are required to
[reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the
corresponding [pipeline](/doc/command-reference/pipeline). (See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more information on
dependencies and outputs.)

`dvc fetch` ensures that the files needed for a DVC-file to be
<abbr>dependencies</abbr> or <abbr>outputs</abbr> in a target
[stage](/doc/command-reference/run) (in `dvc.yaml`) or `.dvc` file, so they are
required to [reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the
corresponding [pipeline](/doc/command-reference/pipeline).

`dvc fetch` ensures that the files needed for a
[stage](/doc/command-reference/run) or `.dvc` file to be
[reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in
cache. If no `targets` are specified, the set of data files to fetch is
determined by analyzing all DVC-files in the current branch, unless
`--all-branches` or `--all-tags` is specified.
determined by analyzing all `dvc.yaml` and `.dvc` files in the current branch,
unless `--all-branches` or `--all-tags` is specified.

The default remote is used (see `dvc config core.remote`) unless the `--remote`
option is used.
Expand All @@ -65,9 +66,10 @@ option is used.
perform data synchronization among local and remote storage. The specific way in
which the set of files to push/fetch/pull is determined begins with calculating
file hashes when these are [added](/doc/command-reference/add) with DVC. File
hashes are stored in the corresponding DVC-files (typically versioned with Git).
Only the hashes specified in DVC-files currently in the workspace are considered
by `dvc fetch` (unless the `-a` or `-T` options are used).
hash values are stored in the corresponding `dvc.yaml` or `.dvc` files
(typically versioned with Git). Only the hash specified in `dvc.yaml` or `.dvc`
files currently in the workspace are considered by `dvc fetch` (unless the `-a`
or `-T` options are used).

## Options

Expand All @@ -76,14 +78,14 @@ by `dvc fetch` (unless the `-a` or `-T` options are used).
`dvc remote list`).

- `-d`, `--with-deps` - determines files to download by tracking dependencies to
the target DVC-files (stages). If no `targets` are provided, this option is
ignored. By traversing all stage dependencies, DVC searches backward from the
target stages in the corresponding pipelines. This means DVC will not fetch
files referenced in later stages than the `targets`.
the `targets`. If none are provided, this option is ignored. By traversing all
stage dependencies, DVC searches backward from the target stages in the
corresponding pipelines. This means DVC will not fetch files referenced in
later stages than the `targets`.

- `-R`, `--recursive` - determines the files to fetch by searching each target
directory and its subdirectories for DVC-files to inspect. If there are no
directories among the `targets`, this option is ignored.
directory and its subdirectories for `dvc.yaml` and `.dvc` files to inspect.
If there are no directories among the `targets`, this option is ignored.

- `-j <number>`, `--jobs <number>` - number of threads to run simultaneously to
handle the downloading of files from the remote. The default value is
Expand All @@ -93,7 +95,7 @@ by `dvc fetch` (unless the `-a` or `-T` options are used).

- `-a`, `--all-branches` - fetch cache for all Git branches instead of just the
current workspace. This means DVC may download files needed to reproduce
different versions of a DVC-file
different versions of a `.dvc` file
([experiments](/doc/tutorials/get-started/experiments)), not just the ones
currently in the workspace. Note that this can be combined with `-T` below,
for example using the `-aT` flag.
Expand Down Expand Up @@ -194,9 +196,11 @@ Note that the `.dvc/cache` directory was created and populated.
> for more info.
Used without arguments (as above), `dvc fetch` downloads all assets needed by
all DVC-files in the current branch, including for directories. The hash values
`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4`
correspond to the `model.pkl` file and `data/features/` directory, respectively.
all [`dvc.yaml`](/doc/user-guide/dvc-file-format) and
[`.dvc`](/doc/user-guide/dvc-file-format) files in the current branch, including
for directories. The hash values `3863d0e317dee0a55c4e59d2ec0eef33` and
`42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and
`data/features/` directory, respectively.

Let's now link files from the cache to the workspace with:

Expand All @@ -210,7 +214,7 @@ $ dvc checkout
> follow this example if you tried the previous one (**Default behavior**).
`dvc fetch` only downloads the data files of a specific stage when the
corresponding DVC-file (command target) is specified:
corresponding `.dvc` file (command target) is specified:

```dvc
$ dvc fetch prepare.dvc
Expand Down Expand Up @@ -276,7 +280,7 @@ $ tree .dvc/cache
```

Fetching using `--with-deps` starts with the target
[DVC-file](/doc/user-guide/dvc-file-format) (`train.dvc` stage) and searches
[`.dvc` file](/doc/user-guide/dvc-file-format) (`train.dvc` stage) and searches
backwards through its pipeline for data to download into the project's cache.
All the data for the second and third stages ("featurize" and "train") has now
been downloaded to the cache. We could now use `dvc checkout` to get the data
Expand Down
35 changes: 19 additions & 16 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

Download a file or directory from a supported URL (for example `s3://`,
`ssh://`, and other protocols) into the <abbr>workspace</abbr>, and track
changes in the remote data source. Creates a DVC-file.
changes in the remote data source. Creates a
[`.dvc` file](/doc/user-guide/dvc-file-format).

> See `dvc import` to download and tack data/model files or directories from
> other <abbr>DVC repositories</abbr> (e.g. hosted on Github).
Expand Down Expand Up @@ -41,11 +42,11 @@ while `out` can be used to specify the directory and/or file name desired for
the downloaded data. If an existing directory is specified, the file or
directory will be placed inside.

[DVC-files](/doc/user-guide/dvc-file-format) support references to data in an
[`.dvc` files](/doc/user-guide/dvc-file-format) support references to data in an
external location, see
[External Dependencies](/doc/user-guide/external-dependencies). In such a
DVC-file, the `deps` field stores the remote URL, and the `outs` field contains
the corresponding local path in the <abbr>workspace</abbr>. It records enough
[External Dependencies](/doc/user-guide/external-dependencies). In such a `.dvc`
file, the `deps` field stores the remote URL, and the `outs` field contains the
corresponding local path in the <abbr>workspace</abbr>. It records enough
metadata about the imported data to enable DVC efficiently determining whether
the local copy is out of date.

Expand Down Expand Up @@ -102,9 +103,11 @@ $ dvc run -d https://example.com/path/to/data.csv \
wget https://example.com/path/to/data.csv -O data.csv
```

Both methods generate a [DVC-files](/doc/user-guide/dvc-file-format) with an
external dependency, but the one created by `dvc import-url` preserves the
connection to the data source. We call this an _import stage_.
`dvc import-url` generates an import stage
[`.dvc` file](/doc/user-guide/dvc-file-format) and `dvc run` a regular stage (in
[`dvc.yaml`](/doc/user-guide/dvc-file-format)). Both have an external
dependency, but the one created by `dvc import-url` preserves the connection to
the data source. We call this an _import stage_.

Note that import stages are considered always
[frozen](/doc/command-reference/freeze), meaning that if you run `dvc repro`,
Expand All @@ -114,9 +117,9 @@ from the external data source.
## Options

- `-f <filename>`, `--file <filename>` - specify a path and/or file name for the
DVC-file created by this command (e.g. `-f stages/stage.dvc`). This overrides
the default file name: `<file>.dvc`, where `<file>` is the desired file name
of the imported data (`out`).
`.dvc` file created by this command (e.g. `-f stages/stage.dvc`). This
overrides the default file name: `<file>.dvc`, where `<file>` is the desired
file name of the imported data (`out`).

- `-h`, `--help` - prints the usage/help message, and exit.

Expand Down Expand Up @@ -168,7 +171,7 @@ To track the changes with git, run:
git add data.xml.dvc data/.gitignore
```

Let's take a look at the resulting stage file (DVC-file) `data.xml.dvc`:
Let's take a look at the resulting stage file (`.dvc` file) `data.xml.dvc`:

```yaml
md5: 61e80c38c1ce04ed2e11e331258e6d0d
Expand All @@ -184,7 +187,7 @@ outs:
persist: false
```
The `etag` field in the DVC-file contains the
The `etag` field in the `.dvc` file contains the
[ETag](https://en.wikipedia.org/wiki/HTTP_ETag) recorded from the HTTP request.
If the remote file changes, its ETag will be different. This metadata allows DVC
to determine whether its necessary to download it again.
Expand Down Expand Up @@ -242,7 +245,7 @@ outs:
persist: false
```

The DVC-file is nearly the same as in the previous example. The difference is
The `.dvc` file is nearly the same as in the previous example. The difference is
that the dependency (`deps`) now references the local file in the data store
directory we created previously. (Its `path` has the URL for the data store.)
And instead of an `etag` we have an `md5` hash value. We did this so its easy to
Expand Down Expand Up @@ -310,8 +313,8 @@ Data and pipelines are up to date.

In the data store directory, edit `data.xml`. It doesn't matter what you change,
as long as it remains a valid XML file, because any change will result in a
different dependency file hash (`md5`) in the import stage DVC-file. Once we do
so, we can run `dvc update` to make sure the import stage is up to date:
different dependency file hash (`md5`) in the import stage `.dvc` file. Once we
do so, we can run `dvc update` to make sure the import stage is up to date:

```dvc
$ dvc update data.xml.dvc
Expand Down
36 changes: 19 additions & 17 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Download a file or directory tracked by DVC or by Git into the
<abbr>workspace</abbr>. It also creates a
[DVC-file](/doc/user-guide/dvc-file-format) with information about the data
[`.dvc` file](/doc/user-guide/dvc-file-format) with information about the data
source, which can later be used to [update](/doc/command-reference/update) the
import.

Expand Down Expand Up @@ -44,7 +44,8 @@ The `path` argument is used to specify the location of the target to be
downloaded within the source repository at `url`. `path` can specify any file or
directory in the source repo, including those tracked by DVC, or by Git. Note
that DVC-tracked targets should be found in a
[DVC-file](/doc/user-guide/dvc-file-format) of the project.
[`dvc.yaml`](/doc/user-guide/dvc-file-format) or
[`.dvc`](/doc/user-guide/dvc-file-format) file of the project.

⚠️ The project should have a default
[DVC remote](/doc/command-reference/remote), containing the actual data for this
Expand All @@ -55,15 +56,16 @@ command to work.
After running this command successfully, the imported data is placed in the
current working directory (unless `-o` is used) with its original file name e.g.
`data.txt`. An _import stage_ (DVC-file) is also created in the same location,
extending the name of the imported data e.g. `data.txt.dvc` – similar to having
used `dvc run` to generate the data as a stage <abbr>output</abbr>.
`data.txt`. An _import stage_ (`.dvc` file) is also created in the same
location, extending the name of the imported data e.g. `data.txt.dvc` – similar
to having used `dvc run` to generate the data as a stage <abbr>output</abbr>.

DVC-files support references to data in an external DVC repository (hosted on a
Git server). In such a DVC-file, the `deps` field specifies the remote `url` and
data `path`, and the `outs` field contains the corresponding local path in the
<abbr>workspace</abbr>. It records enough metadata about the imported data to
enable DVC efficiently determining whether the local copy is out of date.
`.dvc` files support references to data in an external DVC repository (hosted on
a Git server). In such a `.dvc` file, the `deps` field specifies the remote
`url` and data `path`, and the `outs` field contains the corresponding local
path in the <abbr>workspace</abbr>. It records enough metadata about the
imported data to enable DVC efficiently determining whether the local copy is
out of date.

To actually
[track the data](https://dvc.org/doc/tutorials/get-started/data-versioning),
Expand Down Expand Up @@ -113,8 +115,8 @@ Importing 'data/data.xml ([email protected]:iterative/example-get-started)'

In contrast with `dvc get`, this command doesn't just download the data file,
but it also creates an import stage
([DVC-file](/doc/user-guide/dvc-file-format)) with a link to the data source (as
explained in the description above). (This import stage can later be used to
([`.dvc` file](/doc/user-guide/dvc-file-format)) with a link to the data source
(as explained in the description above). (This import stage can later be used to
[update](/doc/command-reference/update) the import.) Check `data.xml.dvc`:

```yaml
Expand Down Expand Up @@ -153,7 +155,7 @@ Importing
```

When using this option, the import stage
([DVC-file](/doc/user-guide/dvc-file-format)) will also have a `rev` subfield
([`.dvc` file](/doc/user-guide/dvc-file-format)) will also have a `rev` subfield
under `repo`:

```yaml
Expand All @@ -167,7 +169,7 @@ deps:

If `rev` is a Git branch or tag (where the underlying commit changes), the data
source may have updates at a later time. To bring it up to date if so (and
update `rev_lock` in the DVC-file), simply use `dvc update <stage>.dvc`. If
update `rev_lock` in the `.dvc` file), simply use `dvc update <stage>.dvc`. If
`rev` is a specific commit hash (does not change), `dvc update` without options
will not have an effect on the import stage. You may force-update it to a
different commit with `dvc update --rev`:
Expand All @@ -185,7 +187,7 @@ If you take a look at our
[dataset registry](https://github.com/iterative/dataset-registry)
<abbr>project</abbr>, you'll see that it's organized into different directories
such as `tutorial/ver` and `use-cases/`, and these contain
[DVC-files](/doc/user-guide/dvc-file-format) that track different datasets.
[`.dvc` files](/doc/user-guide/dvc-file-format) that track different datasets.
Given this simple structure, its data files can be easily shared among several
other projects using `dvc get` and `dvc import`. For example:

Expand All @@ -206,7 +208,7 @@ $ dvc import [email protected]:iterative/dataset-registry.git \
`dvc import` provides a better way to incorporate data files tracked in external
<abbr>DVC repositories</abbr> because it saves the connection between the
current project and the source repo. This means that enough information is
recorded in an import stage (DVC-file) in order to
recorded in an import stage (`.dvc` file) in order to
[reproduce](/doc/command-reference/repro) downloading of this same data version
in the future, where and when needed. This is achieved with the `repo` field,
for example (matching the import command above):
Expand Down Expand Up @@ -245,7 +247,7 @@ Importing ...
> Note that Git-tracked files can be imported from DVC repos as well.

The file is imported, and along with it, an import stage
([DVC-file](/doc/user-guide/dvc-file-format)) file is created. Check
([`.dvc` file](/doc/user-guide/dvc-file-format)) file is created. Check
`it-standards.csv.dvc`:

```yaml
Expand Down
11 changes: 6 additions & 5 deletions content/docs/command-reference/list.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,13 @@ positional arguments:

## Description

DVC, by effectively replacing data files, models, directories with DVC-files
DVC, by effectively replacing data files, models, directories with `.dvc` files
(`.dvc`), hides actual locations and names. This means that you don't see data
files when you browse a <abbr>DVC repository</abbr> on Git hosting (e.g.
Github), you just see the DVC-files. This makes it hard to navigate the project
to find <abbr>data artifacts</abbr> for use with `dvc get`, `dvc import`, or
`dvc.api`.
Github), you just see the [`dvc.yaml`](/doc/user-guide/dvc-file-format) and
[`.dvc`](/doc/user-guide/dvc-file-format) files. This makes it hard to navigate
the project to find <abbr>data artifacts</abbr> for use with `dvc get`,
`dvc import`, or `dvc.api`.

`dvc list` prints a virtual view of a DVC repository, as if files and
directories [tracked by DVC](/doc/use-cases/versioning-data-and-model-files)
Expand Down Expand Up @@ -97,7 +98,7 @@ project's page, you will see a similar list, except that `model.pkl` will be
missing. That's because its tracked by DVC and not visible to Git. You can find
it in the
[`train.dvc`](https://github.com/iterative/example-get-started/blob/master/train.dvc)
DVC-file (`outs` field).
`.dvc` file (`outs` field).

We can now, for example, download the model file with:

Expand Down
Loading

0 comments on commit 55ba0a1

Please sign in to comment.