Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc exp run unavailable in CML #7547

Closed
datacubeR opened this issue Apr 4, 2022 · 26 comments · Fixed by iterative/cml.dev#222 or #7791
Closed

dvc exp run unavailable in CML #7547

datacubeR opened this issue Apr 4, 2022 · 26 comments · Fixed by iterative/cml.dev#222 or #7791
Assignees
Labels
A: experiments Related to dvc exp p1-important Important, aka current backlog of things to do

Comments

@datacubeR
Copy link

Is there any way to run DVC exp run in CML?

My use case includes training several experiments in a Cloud or self hosted Server and publish in the PR the results of my work.

For now when trying to do this using the setup action I get an unexpected error.

I understand that commit hides experiment results and I read that CML is not intended for experimentation but that is not quite clear in official documentation.

@casperdcl
Copy link
Contributor

casperdcl commented Apr 4, 2022

I think this is an upstream bug - #7531.

What error are you getting? Are you using setup-cml? In which case you should also use setup-dvc.

Or are you using the CML Docker images?

@casperdcl
Copy link
Contributor

commit hides experiment results and I read that CML is not intended for experimentation

Sorry I don't quite follow. How do commits hide results? CML is intended for provisioning & reporting experiments.

@casperdcl casperdcl added bug dependencies Pull requests that update a dependency file question I have a question? awaiting response we are waiting for your reply, please respond! :) and removed question I have a question? labels Apr 4, 2022
@datacubeR
Copy link
Author

commit hides experiment results and I read that CML is not intended for experimentation

Sorry I don't quite follow. How do commits hide results? CML is intended for provisioning & reporting experiments.

When I say experiments I refer to using dvc exp. Once you commit your changes dvc exp show no longer shows experiments table.

In my specific case I wanted to try the following:

name: experiments
on: [push]
jobs:
  train-model:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - uses: iterative/setup-cml@v1
      - uses: actions/setup-python@v2
        with:
          python-version: '3.x'
      - name: Train model
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip3 install -r requirements.txt
          #bash exp_file.sh
          dvc exp run -S train.C=0.005
          dvc exp run -S train.C=100

          echo "## Resultados del Experimento" >> report.md
          dvc exp show --only-changed --drop 'assets|src' --no-pager --md >> report.md
          
          echo "## Parallel Plot\n" >> report.md
          dvc exp show --only-changed --drop 'assets|src' --pcp --sort-by test_recall >> report.md

          cml send-comment report.md

In this particular case I wanted to report the results of 2 experiments with dvc exp run. I've tried also using a bash file (that you can see commented out) with the following:

dvc exp run --queue -S train.C=5
dvc exp run --queue -S train.C=30
dvc exp run --queue -S train.C=60
dvc exp run --queue -S train.C=120

dvc exp run --run-all

All of the experiments resulted in something like:

image

So my question would be: Is it allowed to run something like what I'm proposing?
Can I run my experiments and report it in the PR?

I can't remember when I read that Experimentation was meant to be done locally and only dvc repro on the CI Pipeline. Can you clarify that please?

Thanks in advance,

Alfonso

@dacbd
Copy link

dacbd commented Apr 4, 2022

My brain has stopped working for the day already, but I don't think there isn't any reason you shouldn't be able to do this. Without having tested anything like this I would blindly expand your checkout step to see if that fixes it:

- uses: actions/checkout@v3
  with:
     fetch-depth: 0

https://github.com/actions/checkout#fetch-all-history-for-all-tags-and-branches

@casperdcl casperdcl transferred this issue from iterative/cml Apr 5, 2022
@casperdcl casperdcl removed dvc labels Apr 5, 2022
@casperdcl
Copy link
Contributor

(Transferred to DVC because it looks like a config/setup issue unrelated to CML)

@casperdcl casperdcl added awaiting response we are waiting for your reply, please respond! :) and removed awaiting response we are waiting for your reply, please respond! :) dependencies Pull requests that update a dependency file labels Apr 5, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Apr 5, 2022

@dacbd's suggestion should be correct, github's checkout action only fetches a single shallow commit by default, this isn't enough of the Git repo for exp run to work properly.

@dberenbaum
Copy link
Collaborator

@pmrowla Why does exp run require more than the baseline commit to be fetched?

@datacubeR
Copy link
Author

@pmrowla Why does exp run require more than the baseline commit to be fetched?

I understand you say dvc exp run depends on the rest of git history, probably because dvc exp show displays info of master metrics too. But if this is my first run of experiments dvc exp run should not depend on data of a different commit to work...

Anyways, I'll be testing this small change during the day and I'll get back to you...
Thanks,

Alfonso

@pmrowla
Copy link
Contributor

pmrowla commented Apr 5, 2022

@pmrowla Why does exp run require more than the baseline commit to be fetched?

The issue is that we use git merge in experiments and doing a merge-base operation in git requires fetch depth > 1 (the specific required depth depends on the commits being merged).

Basically, doing the merge-base requires finding a common ancestor for the commits being merged (that is not any of the merge commits themselves). For our purposes w/DVC experiments, this means that a fetch depth of 2 will usually be sufficient (so baseline commit + 1 parent), but that won't be the case 100% of the time. So if users are trying to use exp run inside github actions, the general solution is to use fetch-depth: 0 (and not use a shallow clone at all).

@dberenbaum

@dberenbaum
Copy link
Collaborator

Thanks @pmrowla. I guess even if we tried to minimize operations where merge is needed, it will still be needed sometimes, and it's probably better to document the proposed workaround above. Opened iterative/dvc.org#3416.

@casperdcl
Copy link
Contributor

casperdcl commented Apr 5, 2022

Ah right so the CML solution would be cml ci --unshallow (a bit better than fetch-depth: 0)

@daavoo daavoo added the A: experiments Related to dvc exp label Apr 6, 2022
@dberenbaum
Copy link
Collaborator

If we go with the suggestion in iterative/dvc.org#3416 (comment), we should keep this open to update the message in DVC to be more informative and helpful. Even if DVC can't tell whether it's in a CI context, it could at least suggest to fetch missing commits and maybe link to docs that could go into more depth about specific use cases like cml.

@skshetry
Copy link
Member

skshetry commented Apr 7, 2022

On the UI side, we could detect the presence of CI/GITHUB_ACTIONS/GITLAB_CI environment variables.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 7, 2022

If we get the invalid commit exception we can just do the extra check to see whether or not we are in a shallow repo (you just have to check whether or not .git/shallow exists). This problem isn't specific to CI environments, it's just more likely to show up there since github uses shallow clones in CI by default.

@dberenbaum
Copy link
Collaborator

Right, I guess the question is whether the message should be general like "don't use a shallow clone" or more CI-specific like "try fetch-depth: 0"? I think probably the former is better even if we can detect CI, since we don't want to have to cover the syntax of different git server behaviors, although we could link to docs where that level of detail could be provided.

@dacbd
Copy link

dacbd commented Apr 7, 2022

+1 for a general message w/ link to docs for more specific details/solutions.

@skshetry
Copy link
Member

skshetry commented Apr 7, 2022

Right, I guess the question is whether the message should be general like "don't use a shallow clone" or more CI-specific like "try fetch-depth: 0"? I think probably the former is better even if we can detect CI, since we don't want to have to cover the syntax of different git server behaviors, although we could link to docs where that level of detail could be provided.

I'd prefer general message with a CI specific hint and link to the docs wherever possible. I'd prefer less redirection as much as possible.

Example:

dvc exp run
ERROR: Cannot run in a shallow-cloned repository

If you are using `actions/checkout`, please set `fetch-depth: 0` so that the repo is not shallow-cloned.
See https://dvc.org/doc/user-guide/troubleshooting#shallow-clone.

@datacubeR
Copy link
Author

datacubeR commented Apr 9, 2022

Dear all,
Thanks for all of your answers. I think the spirit of good open source is havomng everyone really involved, and I really appreciate your help on trying to solve this issue.

I implemented the steps mentioned above and it worked partially. Once I run the experiments it turns out git is not recognized in the Github Machine:

image

In order to solve this issue I found a really nice action here: fregante/setup-git-user@v1. This helps the remote machine make the connection with git in order to make dvc run exp to work.

The final workflow will be like this:

name: experiments
on: [push]
jobs:
  train-model:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
        with: 
          fetch-depth: 0
      - uses: fregante/setup-git-user@v1
      - uses: iterative/setup-cml@v1
      - uses: iterative/setup-dvc@v1
      - uses: actions/setup-python@v2
        with:
          python-version: '3.x'
      - name: Experiment
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }}
        run: |
          pip3 install -r requirements.txt
          #bash exp_file.sh
          dvc exp run -S train.C=0.005
          dvc exp run -S train.C=100

          echo "## Resultados del Experimento" >> report.md
          dvc exp show --only-changed --drop 'assets|src' --no-pager --md >> report.md

          cml send-comment report.md

I think this could be mentioned as a user-case in the documentation because is a very common experimentation workflow, and if we can externalize experimentation to a Github machine or a Cloud Runner would be very good.

On another note, I noticed iterative/setup-dvc@v1 doesn't include yaml package which is a dependency on DVC. So you have to make sure to add it to the requirements.txt otherwise DVC will complain. Of course, unless you think it's a good idea to add it in the action.

Thanks and hope this helps DVC,

Alfonso

@dacbd
Copy link

dacbd commented Apr 9, 2022

@datacubeR, thanks for the feedback; since you are already using cml, you can take a look at running cml ci, which would solve the git user problem.

and if we can externalize experimentation to a Github machine or a Cloud Runner would be very good.

This is part of what cml runner command can help you with, running training workflows on cloud-hosted machines like ec2. If you decide to give it a try and run into any issues or have any other feedback be sure to let us know.

@dberenbaum
Copy link
Collaborator

Bumping the priority of this since it is necessary for integration with cml.

If you are using `actions/checkout`, please set `fetch-depth: 0` so that the repo is not shallow-cloned.

Is this sufficiently general? It's only applicable to GH actions AFAIK. Each git server has its own depth settings and variables, not to mention people who manually do shallow clones.

Ah right so the CML solution would be cml ci (a bit better than fetch-depth: 0)

Also, this is probably a better suggestion, but only if cml is installed.

So maybe we can generalize the message and provide specific examples like fetch-depth: 0 in the docs. I agree with the general point of @skshetry, but I don't see a way to specifically address every possible scenario in the error message.

@dberenbaum dberenbaum added the p1-important Important, aka current backlog of things to do label Apr 12, 2022
@datacubeR
Copy link
Author

If you allow me to chime in, from the user perspective I think use cases are way clearer than just random notes.

For instance, I understand these fixes are only applicable to GitHub Actions, but in my opinion there is no clear documentation on CML CI which I think won't be normally installed since docker containers or virtual machines are normally the most common way to deal with this part.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 13, 2022

Should note that cml ci does not actually unshallow the repo (but will after iterative/cml#957). The correct fix for running dvc exp run inside github actions is to set fetch-depth: 0 in the checkout action.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 13, 2022

Keeping this open on the DVC side until we address clarifying the shallow repo error in DVC UI and/or docs

@pmrowla pmrowla reopened this Apr 13, 2022
@casperdcl
Copy link
Contributor

casperdcl commented Apr 13, 2022

Yup probably just adding "did you forget to unshallow? https://error.dvc.org/unknown-object" or similar to DVC's error message would be gewd.

@dberenbaum
Copy link
Collaborator

dvc exp run succeeded in the workspace (but not in a temp dir) on a shallow clone until fa819b0:

$ dvc -V
2.9.3.dev8+g37f8010b4
$ git clone --depth=1 [email protected]:iterative/example-get-started.git
Cloning into 'example-get-started'...
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (25/25), done.
remote: Total 35 (delta 6), reused 24 (delta 4), pack-reused 0
Receiving objects: 100% (35/35), 35.62 KiB | 17.81 MiB/s, done.
Resolving deltas: 100% (6/6), done.
$ cd example-get-started
$ dvc pull
A       data/prepared/
A       data/data.xml
A       data/features/
A       model.pkl
4 files added and 6 files fetched
$ dvc exp run
'data/data.xml.dvc' didn't change, skipping
Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
Stage 'train' didn't change, skipping
Stage 'evaluate' didn't change, skipping
$ dvc -V # Upgrade DVC version
2.9.3.dev9+gfa819b0b7
(dvc) dave@davids-air:/tmp/example-get-started 13:15:15
$ dvc exp run
ERROR: Merge failed

@pmrowla It makes sense as part of making the workspace and temp implementations more consistent, but is there any way to make at least workspace runs succeed in this scenario? It breaks a previously supported workflow for anyone using CML or other CI actions with exp run.

The issue is that we use git merge in experiments and doing a merge-base operation in git requires fetch depth > 1 (the specific required depth depends on the commits being merged).

Basically, doing the merge-base requires finding a common ancestor for the commits being merged (that is not any of the merge commits themselves). For our purposes w/DVC experiments, this means that a fetch depth of 2 will usually be sufficient (so baseline commit + 1 parent), but that won't be the case 100% of the time. So if users are trying to use exp run inside github actions, the general solution is to use fetch-depth: 0 (and not use a shallow clone at all).

Is a merge necessary to run an experiment? I thought behavior for running experiments was similar to git stash apply, which works fine in this scenario.

@pmrowla
Copy link
Contributor

pmrowla commented Apr 18, 2022

It makes sense as part of making the workspace and temp implementations more consistent, but is there any way to make at least workspace runs succeed in this scenario?

Prior to that change, workspace and temp implementations were essentially two completely different things, which made testing and maintaining both difficult, and led to them having different behavior in a lot of scenarios. I don't think it is worth the development effort required to maintain two separate exp run implementations vs requiring that users unshallow repos in CI for both the workspace and tempdir cases.

Is a merge necessary to run an experiment? I thought behavior for running experiments was similar to git stash apply, which works fine in this scenario.

It works "similar to" stash apply, not exactly like git stash apply. We have to do a merge because libgit2/pygit2 and dulwich do not have implementations for stash apply-ing arbitrary merge commits. In order for us to get the similar behavior, we have to just use regular merge to apply the merge commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp p1-important Important, aka current backlog of things to do
Projects
No open projects
Archived in project
7 participants