Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't find documentation for calling dvc from python module. #2690

Closed
JoeyCarson opened this issue Oct 29, 2019 · 4 comments
Closed

Can't find documentation for calling dvc from python module. #2690

JoeyCarson opened this issue Oct 29, 2019 · 4 comments
Labels
A: api Related to the dvc.api A: Repo Related to the internal Repo api question I have a question?

Comments

@JoeyCarson
Copy link

JoeyCarson commented Oct 29, 2019

One of my utilities pulls in data from various sources and builds a large hierarchy where versioning may or may not be required. Ideally, the hierarchy needs to allow for users to simply changes files in certain directories from time to time and version the appropriate subdirectory of the data.

At first I wanted to have a simple command to track changes, e.g. dvc add rootdir. But that seems to require reindexing of the entire hierarchy, which is not suitable for my use case. Ideally, data import process would dvc add the subdirectory itself when creating it. So I'd like to do that from python.

Otherwise, it would be useful if I can somehow avoid reindexing the whole hierarchy by running dvc add at the root.

As a feature request, I would ask for either some better documentation to describe how to achieve this approach, or documentation for calling dvc API from python (without running shell commands) in order to better work with this style of data organization.

@shcheklein shcheklein added the question I have a question? label Oct 29, 2019
@shcheklein
Copy link
Member

@JoeyCarson the API is not public yet, though we already have users and it should stay more or less stable.

Check the dvc.Repo class. It exposes all the commands, including dvc add.

It might look something along those lines:

from dvc.repo import Repo

repo = Repo()
repo.add(
               "test",
                recursive=False, <-- optional
                no_commit=False, <-- optional
                fname="test", <-- optional
            )

Closing this since it's a question, not an issue. Please feel free to leave any additional comments if something is not working for you.

@efiop
Copy link
Contributor

efiop commented Oct 29, 2019

For the record: our first line of public API will be documented per this ticket iterative/dvc.org#463

@JoeyCarson
Copy link
Author

Thanks for the prompt response folks. This seems simple enough. Perhaps you could recommend a good strategy in terms of versioning the individual directories in a hierarchy.

For instance, consider that I run os.walk from the bottom up of a directory hierarchy, meaning that I will dvc add a directory full of items, and then add that directory's parent directory. Would dvc allow this and would it attempt to reindex the child directory again? Or would it instead somehow point to the child directory's .dvc file?

@efiop
Copy link
Contributor

efiop commented Oct 29, 2019

@JoeyCarson No, it won't allow it, as outputs of your stages will overlap that way. Do you need to ignore some dirs? If so, have you considered .dvcignore functionality?

@daavoo daavoo added A: Repo Related to the internal Repo api A: api Related to the dvc.api labels Feb 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: api Related to the dvc.api A: Repo Related to the internal Repo api question I have a question?
Projects
None yet
Development

No branches or pull requests

4 participants