dvc exp run unavailable in CML #7547

datacubeR · 2022-04-04T04:14:20Z

Is there any way to run DVC exp run in CML?

My use case includes training several experiments in a Cloud or self hosted Server and publish in the PR the results of my work.

For now when trying to do this using the setup action I get an unexpected error.

I understand that commit hides experiment results and I read that CML is not intended for experimentation but that is not quite clear in official documentation.

casperdcl · 2022-04-04T08:31:50Z

I think this is an upstream bug - #7531.

What error are you getting? Are you using setup-cml? In which case you should also use setup-dvc.

Or are you using the CML Docker images?

casperdcl · 2022-04-04T08:33:54Z

commit hides experiment results and I read that CML is not intended for experimentation

Sorry I don't quite follow. How do commits hide results? CML is intended for provisioning & reporting experiments.

datacubeR · 2022-04-04T13:35:28Z

commit hides experiment results and I read that CML is not intended for experimentation

Sorry I don't quite follow. How do commits hide results? CML is intended for provisioning & reporting experiments.

When I say experiments I refer to using dvc exp. Once you commit your changes dvc exp show no longer shows experiments table.

In my specific case I wanted to try the following:

name: experiments
on: [push]
jobs:
  train-model:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
      - uses: actions/setup-python@v2
        with:
          python-version: '3.x'
      - name: Train model
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip3 install -r requirements.txt
          #bash exp_file.sh
          dvc exp run -S train.C=0.005
          dvc exp run -S train.C=100

          echo "## Resultados del Experimento" >> report.md
          dvc exp show --only-changed --drop 'assets|src' --no-pager --md >> report.md
          
          echo "## Parallel Plot\n" >> report.md
          dvc exp show --only-changed --drop 'assets|src' --pcp --sort-by test_recall >> report.md

          cml send-comment report.md

In this particular case I wanted to report the results of 2 experiments with dvc exp run. I've tried also using a bash file (that you can see commented out) with the following:

dvc exp run --queue -S train.C=5
dvc exp run --queue -S train.C=30
dvc exp run --queue -S train.C=60
dvc exp run --queue -S train.C=120

dvc exp run --run-all

All of the experiments resulted in something like:

So my question would be: Is it allowed to run something like what I'm proposing?
Can I run my experiments and report it in the PR?

I can't remember when I read that Experimentation was meant to be done locally and only dvc repro on the CI Pipeline. Can you clarify that please?

Thanks in advance,

Alfonso

dacbd · 2022-04-04T21:46:16Z

My brain has stopped working for the day already, but I don't think there isn't any reason you shouldn't be able to do this. Without having tested anything like this I would blindly expand your checkout step to see if that fixes it:

- uses: actions/checkout@v3
  with:
     fetch-depth: 0

https://github.com/actions/checkout#fetch-all-history-for-all-tags-and-branches

casperdcl · 2022-04-05T13:29:51Z

(Transferred to DVC because it looks like a config/setup issue unrelated to CML)

pmrowla · 2022-04-05T13:41:35Z

@dacbd's suggestion should be correct, github's checkout action only fetches a single shallow commit by default, this isn't enough of the Git repo for exp run to work properly.

dberenbaum · 2022-04-05T15:12:30Z

@pmrowla Why does exp run require more than the baseline commit to be fetched?

datacubeR · 2022-04-05T15:36:11Z

@pmrowla Why does exp run require more than the baseline commit to be fetched?

I understand you say dvc exp run depends on the rest of git history, probably because dvc exp show displays info of master metrics too. But if this is my first run of experiments dvc exp run should not depend on data of a different commit to work...

Anyways, I'll be testing this small change during the day and I'll get back to you...
Thanks,

Alfonso

pmrowla · 2022-04-05T16:25:17Z

@pmrowla Why does exp run require more than the baseline commit to be fetched?

The issue is that we use git merge in experiments and doing a merge-base operation in git requires fetch depth > 1 (the specific required depth depends on the commits being merged).

Basically, doing the merge-base requires finding a common ancestor for the commits being merged (that is not any of the merge commits themselves). For our purposes w/DVC experiments, this means that a fetch depth of 2 will usually be sufficient (so baseline commit + 1 parent), but that won't be the case 100% of the time. So if users are trying to use exp run inside github actions, the general solution is to use fetch-depth: 0 (and not use a shallow clone at all).

@dberenbaum

dberenbaum · 2022-04-05T17:40:31Z

Thanks @pmrowla. I guess even if we tried to minimize operations where merge is needed, it will still be needed sometimes, and it's probably better to document the proposed workaround above. Opened iterative/dvc.org#3416.

casperdcl · 2022-04-05T19:19:39Z

Ah right so the CML solution would be cml ci --unshallow (a bit better than fetch-depth: 0)

dberenbaum · 2022-04-06T14:29:04Z

If we go with the suggestion in iterative/dvc.org#3416 (comment), we should keep this open to update the message in DVC to be more informative and helpful. Even if DVC can't tell whether it's in a CI context, it could at least suggest to fetch missing commits and maybe link to docs that could go into more depth about specific use cases like cml.

skshetry · 2022-04-07T01:42:25Z

On the UI side, we could detect the presence of CI/GITHUB_ACTIONS/GITLAB_CI environment variables.

pmrowla · 2022-04-07T03:03:27Z

If we get the invalid commit exception we can just do the extra check to see whether or not we are in a shallow repo (you just have to check whether or not .git/shallow exists). This problem isn't specific to CI environments, it's just more likely to show up there since github uses shallow clones in CI by default.

dberenbaum · 2022-04-07T12:49:27Z

Right, I guess the question is whether the message should be general like "don't use a shallow clone" or more CI-specific like "try fetch-depth: 0"? I think probably the former is better even if we can detect CI, since we don't want to have to cover the syntax of different git server behaviors, although we could link to docs where that level of detail could be provided.

dacbd · 2022-04-07T15:24:56Z

+1 for a general message w/ link to docs for more specific details/solutions.

skshetry · 2022-04-07T15:32:39Z

Right, I guess the question is whether the message should be general like "don't use a shallow clone" or more CI-specific like "try fetch-depth: 0"? I think probably the former is better even if we can detect CI, since we don't want to have to cover the syntax of different git server behaviors, although we could link to docs where that level of detail could be provided.

I'd prefer general message with a CI specific hint and link to the docs wherever possible. I'd prefer less redirection as much as possible.

Example:

dvc exp run
ERROR: Cannot run in a shallow-cloned repository

If you are using `actions/checkout`, please set `fetch-depth: 0` so that the repo is not shallow-cloned.
See https://dvc.org/doc/user-guide/troubleshooting#shallow-clone.

datacubeR · 2022-04-09T19:55:26Z

Dear all,
Thanks for all of your answers. I think the spirit of good open source is havomng everyone really involved, and I really appreciate your help on trying to solve this issue.

I implemented the steps mentioned above and it worked partially. Once I run the experiments it turns out git is not recognized in the Github Machine:

In order to solve this issue I found a really nice action here: fregante/setup-git-user@v1. This helps the remote machine make the connection with git in order to make dvc run exp to work.

The final workflow will be like this:

name: experiments
on: [push]
jobs:
  train-model:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with: 
          fetch-depth: 0
      - uses: fregante/setup-git-user@v1
      - uses: iterative/setup-cml@v1
      - uses: iterative/setup-dvc@v1
      - uses: actions/setup-python@v2
        with:
          python-version: '3.x'
      - name: Experiment
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip3 install -r requirements.txt
          #bash exp_file.sh
          dvc exp run -S train.C=0.005
          dvc exp run -S train.C=100

          echo "## Resultados del Experimento" >> report.md
          dvc exp show --only-changed --drop 'assets|src' --no-pager --md >> report.md

          cml send-comment report.md

I think this could be mentioned as a user-case in the documentation because is a very common experimentation workflow, and if we can externalize experimentation to a Github machine or a Cloud Runner would be very good.

On another note, I noticed iterative/setup-dvc@v1 doesn't include yaml package which is a dependency on DVC. So you have to make sure to add it to the requirements.txt otherwise DVC will complain. Of course, unless you think it's a good idea to add it in the action.

Thanks and hope this helps DVC,

Alfonso

dacbd · 2022-04-09T20:48:15Z

@datacubeR, thanks for the feedback; since you are already using cml, you can take a look at running cml ci, which would solve the git user problem.

and if we can externalize experimentation to a Github machine or a Cloud Runner would be very good.

This is part of what cml runner command can help you with, running training workflows on cloud-hosted machines like ec2. If you decide to give it a try and run into any issues or have any other feedback be sure to let us know.

dberenbaum · 2022-04-12T20:18:00Z

Bumping the priority of this since it is necessary for integration with cml.

If you are using `actions/checkout`, please set `fetch-depth: 0` so that the repo is not shallow-cloned.

Is this sufficiently general? It's only applicable to GH actions AFAIK. Each git server has its own depth settings and variables, not to mention people who manually do shallow clones.

Ah right so the CML solution would be cml ci (a bit better than fetch-depth: 0)

Also, this is probably a better suggestion, but only if cml is installed.

So maybe we can generalize the message and provide specific examples like fetch-depth: 0 in the docs. I agree with the general point of @skshetry, but I don't see a way to specifically address every possible scenario in the error message.

datacubeR · 2022-04-12T20:22:49Z

If you allow me to chime in, from the user perspective I think use cases are way clearer than just random notes.

For instance, I understand these fixes are only applicable to GitHub Actions, but in my opinion there is no clear documentation on CML CI which I think won't be normally installed since docker containers or virtual machines are normally the most common way to deal with this part.

pmrowla · 2022-04-13T06:28:35Z

Should note that cml ci does not actually unshallow the repo (but will after iterative/cml#957). The correct fix for running dvc exp run inside github actions is to set fetch-depth: 0 in the checkout action.

pmrowla · 2022-04-13T07:07:40Z

Keeping this open on the DVC side until we address clarifying the shallow repo error in DVC UI and/or docs

casperdcl · 2022-04-13T12:18:07Z

Yup probably just adding "did you forget to unshallow? https://error.dvc.org/unknown-object" or similar to DVC's error message would be gewd.

dberenbaum · 2022-04-18T20:45:08Z

dvc exp run succeeded in the workspace (but not in a temp dir) on a shallow clone until fa819b0:

$ dvc -V
2.9.3.dev8+g37f8010b4
$ git clone --depth=1 [email protected]:iterative/example-get-started.git
Cloning into 'example-get-started'...
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 35 (delta 6), reused 24 (delta 4), pack-reused 0
Receiving objects: 100% (35/35), 35.62 KiB | 17.81 MiB/s, done.
Resolving deltas: 100% (6/6), done.
$ cd example-get-started
$ dvc pull
A       data/prepared/
A       data/data.xml
A       data/features/
A       model.pkl
4 files added and 6 files fetched
$ dvc exp run
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
$ dvc -V # Upgrade DVC version
2.9.3.dev9+gfa819b0b7
(dvc) dave@davids-air:/tmp/example-get-started 13:15:15
$ dvc exp run
ERROR: Merge failed

@pmrowla It makes sense as part of making the workspace and temp implementations more consistent, but is there any way to make at least workspace runs succeed in this scenario? It breaks a previously supported workflow for anyone using CML or other CI actions with exp run.

The issue is that we use git merge in experiments and doing a merge-base operation in git requires fetch depth > 1 (the specific required depth depends on the commits being merged).

Basically, doing the merge-base requires finding a common ancestor for the commits being merged (that is not any of the merge commits themselves). For our purposes w/DVC experiments, this means that a fetch depth of 2 will usually be sufficient (so baseline commit + 1 parent), but that won't be the case 100% of the time. So if users are trying to use exp run inside github actions, the general solution is to use fetch-depth: 0 (and not use a shallow clone at all).

Is a merge necessary to run an experiment? I thought behavior for running experiments was similar to git stash apply, which works fine in this scenario.

pmrowla · 2022-04-18T23:34:09Z

It makes sense as part of making the workspace and temp implementations more consistent, but is there any way to make at least workspace runs succeed in this scenario?

Prior to that change, workspace and temp implementations were essentially two completely different things, which made testing and maintaining both difficult, and led to them having different behavior in a lot of scenarios. I don't think it is worth the development effort required to maintain two separate exp run implementations vs requiring that users unshallow repos in CI for both the workspace and tempdir cases.

Is a merge necessary to run an experiment? I thought behavior for running experiments was similar to git stash apply, which works fine in this scenario.

It works "similar to" stash apply, not exactly like git stash apply. We have to do a merge because libgit2/pygit2 and dulwich do not have implementations for stash apply-ing arbitrary merge commits. In order for us to get the similar behavior, we have to just use regular merge to apply the merge commit.

casperdcl added bug dependencies Pull requests that update a dependency file question I have a question? awaiting response we are waiting for your reply, please respond! :) and removed question I have a question? labels Apr 4, 2022

casperdcl transferred this issue from iterative/cml Apr 5, 2022

casperdcl removed dvc labels Apr 5, 2022

casperdcl added awaiting response we are waiting for your reply, please respond! :) and removed awaiting response we are waiting for your reply, please respond! :) dependencies Pull requests that update a dependency file labels Apr 5, 2022

dberenbaum mentioned this issue Apr 5, 2022

exp: CI runs require unshallow clone (fetch-depth: >1) iterative/dvc.org#3416

Closed

daavoo added the A: experiments Related to dvc exp label Apr 6, 2022

dberenbaum added the p1-important Important, aka current backlog of things to do label Apr 12, 2022

dacbd mentioned this issue Apr 13, 2022

Add example usage cases for cml ci iterative/cml.dev#221

Open

This was referenced Apr 13, 2022

ci: add --unshallow iterative/cml#957

Merged

ref/ci: --unshallow iterative/cml.dev#222

Merged

casperdcl closed this as completed in iterative/cml.dev#222 Apr 13, 2022

pmrowla reopened this Apr 13, 2022

pmrowla self-assigned this May 20, 2022

pmrowla mentioned this issue May 20, 2022

exp: output troubleshooting link on shallow merge failure #7791

Merged

2 tasks

pmrowla closed this as completed in #7791 Jun 2, 2022

dberenbaum mentioned this issue Jan 19, 2023

exp run: make it work more easily in ci #8844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc exp run unavailable in CML #7547

dvc exp run unavailable in CML #7547

datacubeR commented Apr 4, 2022

casperdcl commented Apr 4, 2022 •

edited

Loading

casperdcl commented Apr 4, 2022

datacubeR commented Apr 4, 2022

dacbd commented Apr 4, 2022

casperdcl commented Apr 5, 2022

pmrowla commented Apr 5, 2022

dberenbaum commented Apr 5, 2022

datacubeR commented Apr 5, 2022

pmrowla commented Apr 5, 2022 •

edited

Loading

dberenbaum commented Apr 5, 2022

casperdcl commented Apr 5, 2022 •

edited

Loading

dberenbaum commented Apr 6, 2022

skshetry commented Apr 7, 2022

pmrowla commented Apr 7, 2022

dberenbaum commented Apr 7, 2022

dacbd commented Apr 7, 2022

skshetry commented Apr 7, 2022

datacubeR commented Apr 9, 2022 •

edited

Loading

dacbd commented Apr 9, 2022 •

edited by casperdcl

Loading

dberenbaum commented Apr 12, 2022

datacubeR commented Apr 12, 2022

pmrowla commented Apr 13, 2022 •

edited

Loading

pmrowla commented Apr 13, 2022

casperdcl commented Apr 13, 2022 •

edited

Loading

dberenbaum commented Apr 18, 2022

pmrowla commented Apr 18, 2022

dvc exp run unavailable in CML #7547

dvc exp run unavailable in CML #7547

Comments

datacubeR commented Apr 4, 2022

casperdcl commented Apr 4, 2022 • edited Loading

casperdcl commented Apr 4, 2022

datacubeR commented Apr 4, 2022

dacbd commented Apr 4, 2022

casperdcl commented Apr 5, 2022

pmrowla commented Apr 5, 2022

dberenbaum commented Apr 5, 2022

datacubeR commented Apr 5, 2022

pmrowla commented Apr 5, 2022 • edited Loading

dberenbaum commented Apr 5, 2022

casperdcl commented Apr 5, 2022 • edited Loading

dberenbaum commented Apr 6, 2022

skshetry commented Apr 7, 2022

pmrowla commented Apr 7, 2022

dberenbaum commented Apr 7, 2022

dacbd commented Apr 7, 2022

skshetry commented Apr 7, 2022

datacubeR commented Apr 9, 2022 • edited Loading

dacbd commented Apr 9, 2022 • edited by casperdcl Loading

dberenbaum commented Apr 12, 2022

datacubeR commented Apr 12, 2022

pmrowla commented Apr 13, 2022 • edited Loading

pmrowla commented Apr 13, 2022

casperdcl commented Apr 13, 2022 • edited Loading

dberenbaum commented Apr 18, 2022

pmrowla commented Apr 18, 2022

casperdcl commented Apr 4, 2022 •

edited

Loading

pmrowla commented Apr 5, 2022 •

edited

Loading

casperdcl commented Apr 5, 2022 •

edited

Loading

datacubeR commented Apr 9, 2022 •

edited

Loading

dacbd commented Apr 9, 2022 •

edited by casperdcl

Loading

pmrowla commented Apr 13, 2022 •

edited

Loading

casperdcl commented Apr 13, 2022 •

edited

Loading