Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC pipeline list only shows 1 pipeline #3460

Closed
guillecarc opened this issue Mar 9, 2020 · 10 comments
Closed

DVC pipeline list only shows 1 pipeline #3460

guillecarc opened this issue Mar 9, 2020 · 10 comments
Labels
discussion requires active participation to reach a conclusion p2-medium Medium priority, should be done, but less important

Comments

@guillecarc
Copy link

dvc pipeline list only shows 1 pipeline as a result even though there are several stages that start from the same previous stage.

The use case that help me discover it, is that i am creating several models. Each models feeds from a baseline pipeline that loads and preprocess data.

Platform and method of installation: pkg Mac

DVC version: 0.87.0
Python version: 3.7.5
Platform: Darwin-18.7.0-x86_64-i386-64bit
Binary: True
Package: osxpkg
Filesystem type (workspace): ('apfs', '/dev/disk1s1')

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Mar 9, 2020
@pared
Copy link
Contributor

pared commented Mar 9, 2020

Reproduction script:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo
set -ex

git init --quiet
dvc init -q

echo data >> data
dvc add data -q


dvc run -d data -o preprocessed -f 1.dvc "cat data > preprocessed"
dvc run -d preprocessed -o res1 -f 2.dvc "cat preprocessed > res1"
dvc run -d preprocessed -o res2 -f 3.dvc "cat preprocessed > res2"

dvc pipeline list

Will show:

+ dvc pipeline list
1.dvc
2.dvc
3.dvc
data.dvc

While one could expect two lists:

1: data -> 1.dvc -> 2.dvc
2: data -> 1.dvc -> 3.dvc

@drorata
Copy link

drorata commented Mar 10, 2020

Isn't it related to #2392?

@efiop
Copy link
Contributor

efiop commented Mar 11, 2020

While one could expect two lists:

@pared But that is the same pipeline. dvc pipeline list just dumps the pipeline stages, not really each possible path from root to leaves.

@pared pared added the p2-medium Medium priority, should be done, but less important label Mar 11, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Mar 11, 2020
@pared pared added the discussion requires active participation to reach a conclusion label Mar 11, 2020
@pared
Copy link
Contributor

pared commented Mar 11, 2020

@guillecarc take note of @efiop comment.
Still, looking at behaviour for show and list, it seems that for
show - pipeline is the target stage with its predecessors
list- (in given example) pipeline is root stage with all "child" stages.

It seems to me we should either unify the behaviour or make this disrepancy clear.
Would you agree?
@efiop @jorgeorpinel?

@drorata
Copy link

drorata commented Mar 12, 2020

If I remember correctly, the problem (also discussed in #2392) is that the show return the whole connected-component of the stage and not only "up to the stage".

@pared
Copy link
Contributor

pared commented Mar 12, 2020

@drorata As I recall, yes that was the case back then. But along the development process we had some changes that influenced how pipeline commands are handled.
For example:

#!/bin/bash

rm -rf repo
mkdir repo

pushd repo

git init --quiet && dvc init -q

echo data >> data

dvc add data -q

dvc run -q -d data -o data_train "echo data_train >> data_train"
dvc run -q -d data -o data_test "echo data_test >> data_test"
dvc run -q -d data_test -d data_train -o result "echo result >> result"

dvc run -q -d data -o branch "echo branch >> branch"

When we run dvc pipeline show result.dvc we get:

                 +----------+                    
                 | data.dvc |                    
                 +----------+*                   
               ***            ***                
             **                  **              
           **                      **            
+---------------+            +----------------+  
| data_test.dvc |            | data_train.dvc |  
+---------------+            +----------------+  
               ***            ***                
                  **        **                   
                    **    **                     
                +------------+                   
                | result.dvc |                   
                +------------+                   

When we run dvc pipeline show --ascii data_test.dvc:

  +----------+     
  | data.dvc |     
  +----------+     
        *          
        *          
        *          
+---------------+  
| data_test.dvc |  
+---------------+  

So now, show takes target and its predecessors.
I think we should talk through how we are handling pipeline subcommands.
Now, I would say we cannot get the full idea of how our DAG looks without analysis of stage files/ running pipeline show a few times.
In the given example we would have to:

  1. Run dvc pipelines list to see all interconnected stages
  2. Run dvc pipelines show at least 2 times (for result.dvc and branch.dvc) to get the idea of how the whole project looks like. Of course, seeing the project first time, it will be much harder.

@jorgeorpinel
Copy link
Contributor

It'd be great to review the 2 cmd refs in the docs repo. Maybe this issue can be transferred there or a new one opened. Thanks

@drorata
Copy link

drorata commented Mar 12, 2020

I cannot say whether it is a documentation discussion or design, but what I can say... unfortunately, I didn't update DVC in the main project I'm using it in (I'm too worried that something will break). So maybe indeed the behavior changed. @pared when did it change?

Otherwise, I would say that there should be the following options:

  1. Given a .dvc stage show all:
    1. preceding stages
    2. all stages in the connected component of the DAG to which the stage belongs to
  2. or, all connected components of the DAG.

Does it make sense?

@pared
Copy link
Contributor

pared commented Mar 13, 2020

@drorata just to note:
we are talking here about stage collection for pipelines command which has only "visual" value.
Collecting stages and building graph for reproduction happens elsewhere, and has not been touched for quite some time.

What version do you use in your main repo? If newer versions were to introduce some problems, you could probably roll them back using git. As long as the cache is not removed and you use git, you should be fine.

The change that introduced different show rules was introduced in 0.82.9 version

@skshetry
Copy link
Member

skshetry commented Dec 9, 2020

Closing as there's no pipeline list, there's dvc dag for dvc pipeline show.

@skshetry skshetry closed this as completed Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

6 participants